METHOD FOR LOSSLESS ENCODING OF IMAGE DATA BY APPROXIMATING 
LINEAR TRANSFORMS AND PRESERVING SELECTED PROPERTIES FOR 
IMAGE PROCESSING 



CROSS-REFERENCES TO RELATED APPLICATIONS 

This application claims the benefit of provisional U.S. Application Serial 
No. 60/250,829, filed December 1, 2000; and provisional U.S. Application Serial 
No. 60/250,850, filed December 1, 2000, under 35 U.S.C. § 119. 

FIELD OF THE INVENTION 

This invention relates to image processing and, more particularly, to a method for 
approximating image processing transforms with invertible integer-to-integer transforms. 
BACKGROUND OF THE INVENTION 

Contemporary image processing methods apply image data transforms in a vast 
array of applications. In many cases, the transformation process causes a loss of data and 
frequently requires complex image data compression and decompression circuitry for 
image transmission and restoration. Lossless image transformation processes are 
generally perceived as being computationally expensive and tend not to be used as often 
as lossy transformation processes. Among the most widely used image data transforms 
are the wavelet transform and the color transform. 

The processing of image data also requires sophisticated methods for 
manipulating data stored in multi-dimensional matrices. At times, such data can be 
highly position dependent and any processing on such data may require certain types of 
highly localized processing given the unique nature of the interactions between image 
data at neighboring locations. These matrices may also be comprised of different types of 
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numerical values. Frequently, such matrices include real number entries that are applied 
to an input data set that may also be comprised of real numbers or other types of 
numerical data. In the event integer input data is provided, the output image data 
produced by a transformation matrix comprised of real numbers may not necessarily 
provide the most accurate mapping of input data values to output data values. 
Furthermore, the reliable mapping of entire classes of numerical data may require the 
processing of bounded and unbounded length input data vectors. 

A reliable image data transformation process is needed that can be used to map 
integer input data to integer output data without the possible loss of data associated with 
the transformation of such input data by transformation matrices including real number 
entries. This process must be capable of being applied to the most widely used 
transforms for input data vectors that are bounded and unbounded in length. The present 
invention is directed to providing computer-implemented methods for generating integer- 
to-integer transformation matrices that are approximations to known linear transformation 
matrices. 

SUMMARY OF THE INVENTION 

In accordance with the present invention, a method for lossless encoding of image 
data involving the approximation of an invertible linear transformation by an integer-to- 
integer transformation is disclosed. The invertible linear transformation may be applied 
to vectors of a fixed length or to vectors of unbounded length which are of bounded 
amplitude. An approximation to an invertible linear transformation is at times necessary 
when the input vectors are integers and the entries in a matrix including the 
transformation are real numbers. The invention discloses methods for factoring a matrix 
including an invertible linear transformation into simpler matrices of a form which allows 
them to be approximated directly. 

Among the types of matrices that may be generated by the approximation methods 
are elementary matrices, permutation matrices and diagonal matrices, each of which are 
simpler to process in an image processing system. It is an object of the present invention 
to enable the factored matrices to have certain useful properties. Among the various 
useful properties preserved by the computer-implemented matrix factorization methods 
are their ability to maintain correct values at certain inputs that a given linear 
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transformation already maps to integers. The methods can also be used to preserve local 
interactions between coordinates in an input subspace that directly influence neighboring 
data in an output subspace. The methods disclosed in this invention can be applied to 
well-known transforms such as the RGB^YC B C R color transform and the 9-7 wavelet 
transform. 

In accordance with an aspect of the present invention, a method is described for 
generating a first plurality of output data vales by transforming a plurality of input data 
values using a computer. The first plurality of output data values approximate a second 
plurality of output data values. The second plurality of output data values are generated 
by applying a linear transform to the first plurality of input data values. The linear 
transform may be a finite one-dimensional linear transform or a 9-7 wavelet transform. 
This method comprises at least one step, the step being a rearrangement of at least one 
data value in a plurality of current input data values, a negation of at least one value in the 
plurality of current input data values, a modification at least one data value in the 
plurality of current input data values, or a successive combination of one or more of these 
steps. 

In accordance with another aspect of the invention, a method for factoring a linear 
transformation matrix storing a finite one-dimensional linear transform is disclosed. The 
method requires the LU-decomposition of the linear transformation matrix, the generation 
of four matrices from the LU-decomposed matrix, the generation of a third matrix and a 
signed permutation matrix, the generation of a permuted linear transformation matrix, the 
computation of an upper triangular matrix from the permuted linear transformation 
matrix, the factoring of the permuted linear transformation matrix, and the generation of 
matrix factors from the linear transformation matrix. 

In yet another aspect of the present invention, a computer-implemented method is 
disclosed for generating a sequence of matrix factors for use in generating a matrix that 
approximates the transformation matrix that stores a wavelet transform. This method 
involves the application of at least one row reduction operation to the transformation 
matrix and the generation of a sequence of matrix factors from the transformation matrix 
after application of the at least one row reduction operation. 
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As will be readily appreciated from the foregoing description, the invention 
provides several new methods for the lossless encoding of image data by use of an 
approximation matrix that can map integer input data to integer output data that is the 
same as that mapped to by a transformation matrix storing a known linear transform. 
BRIEF DESCRIPTION OF THE DRAWINGS 

The foregoing aspects and many of the attendant advantages of this invention will 
become more readily appreciated as the same become better understood by reference to 
the following detailed description, when taken in conjunction with the accompanying 
drawings, wherein: 

FIGURE 1 is a flowchart illustrating a process for generating matrix factors in 
accordance with the present invention; 

FIGURE 2 is a flowchart illustrating a process for transforming a plurality of 
input data values in accordance with the present invention; 

FIGURE 3 is a flowchart illustrating a process for generating matrix factors for a 
finite-dimensional linear transform in accordance with the present invention; 

FIGURE 4 is a flowchart illustrating a process for generating matrix factors for a 
wavelet transform stored in a transformation matrix in accordance with the present 
invention. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 
While the preferred embodiment of the invention has been illustrated and 
described, it will be appreciated that various changes can be made therein without 
departing from the spirit and scope of the invention. 

INTRODUCTION 

In various signal processing applications, one must apply an invertible linear 
transformation. The input vectors are often given as integers, while the transformation 
has real entries. In some situations, it is convenient to approximate the linear 
transformation by another (possibly nonlinear) invertible transformation which maps 
integer vectors to integer vectors. 

There are actually two versions of this problem. The first version exists when the 
given vectors have a fixed (probably small) length n. Given a linear map from R" onto 
R" specified by an nxn matrix A, and a (probably nonlinear) bijection <p is needed from 
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Z" to Z" which is close to A in the sense that || Ax - <px|| is bounded for xeZ". (We will 
use the Euclidean norm, but one could use other norms as well. Note also the use of the 
words "length" for the number of coordinates in a vector, and "norm" for the Euclidean 
magnitude of the vector.) Equivalently, we have the standard integer lattice Z" and a 
linearly transformed lattice AZ", and we want to find a bijection ip from the 
transformed lattice to the original lattice which moves points as small a distance as 
possible (so ip = (p o A -1 ). Please note that we are using the same symbol A for the linear 
transformation and for its associated matrix. 

In the second version of the problem, the input vectors are signals which can have 
unbounded length, but are of bounded amplitude (i.e., the values appearing as coordinates 
of the vector are bounded). Both versions of the problem were described in co-pending 
and co-filed prior provisional U. S. Application Serial Nos. 60/250,829 and 60/250,850, 
each of which are hereby incorporated by reference. Several methods providing practical 
solutions to these problems will be discussed in this document. The image processing 
methods described herein can be performed in various software systems. 

For both versions of the problem, the goal is to find an integer bijection <p 
approximating the given transformation so that the approximation error is bounded over 
all inputs, preferably with a small bound. (Of course, there are other properties one 
would like <p to have, such as easy computability of (p and <p~ l .) As we will see, this is 
possible only if the determinant is ±1 in the fixed-length case; there is a similar 
restriction in the unbounded-length case. Even then it is not obvious that one can get the 
error to be bounded. In the fixed-length case, one could try some sort of greedy 
algorithm which initially maps each point in the first lattice to the nearest point in the 
second lattice and then goes through a correction process to resolve collisions (two points 
mapped to the same point) and gaps (points in the second lattice not images of any point 
in the first lattice), but the corrections might get worse and worse as more points are 
processed, and it is not at all clear that one can get a bounded-error bijection this way. 
INVERTEBLE INTEGER-TO-INTEGER MAPPINGS 
We start by showing that, in the fixed-length case, a necessary condition for the 
existence of a bounded-error integer approximation <p to the linear transformation A is 
that det A = ±1. Suppose that such a (p exists with error bound 8 , and let <ip = (poA~ l . 
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Then for a large positive integer M, the points in the transformed lattice AZ" within the 
cube [-M,M] n are mapped by ip to standard lattice points in the slightly larger cube 
[-M -8,M + 8] n , and all standard lattice points in the smaller cube [-M +8,M +S] n 
are reached in this way. So the number of transformed lattice points in the cube [-MMY 1 
must be (2M +1)" + O^M n ~ lS J for large M; this implies that the determinant of the 

transformed lattice (i.e., det A) is ±1. 

We may as well assume that the determinant is 1, because, if it is -1, we can negate 
a row of the matrix to change the determinant to +1. An integer approximation for the 
modified matrix easily yields an integer approximation for the original matrix (just negate 
the specified coordinate at the end). 

The main approach we will use for integer approximations is to divide and 
conquer: if we have a linear transformation with no obvious suitable integer 
approximation, then we factor the matrix into parts which we do know how to 
approximate. The composition of these approximations of parts will be a suitable 
approximation to the entire transformation. To see this, first consider the two-factor case: 
if A = A X A 2 and we have <p x and q> 2 approximating and A 2 so that || Ap - ^xj < Q 
and ||A 2 x-<p 2 x|| < C 2 for all x, then (p x o<p 2 approximates A, because 

\\A A>* " W2*|| ^ I AA> X - A<P2*\\ + \\A<P2* ~ <Pi<P2*\\ 
< lAlH^x - <p 2 x|| + fA^x -(Pi<p 2 4 

We can iterate this: if A = A 2 A 2 • • • A k where each A,- can be approximated by an 
integer mapping with error bound Q , then A can be approximated by the composition of 
these integer mappings with error bound 

(1.1) C 1+ ||A 1 ||C 2+ ||A||||A 2 ||C3 + -.. + |A 1 ||||A 2 ||...||A,_ 1 ||C,. 

If one does the whole computation here at once, rather than iteratively, one gets a 
slightly improved form of the bound: 

(1-2) Q+IK||C2+||A 1 A 2 ||C3 + ... + ||A 1 A 2 .-A i _ 1 ||C yt . 
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Since the goal here is to produce invertible integer approximations to invertible 
linear transformations, we will also be interested in error estimates for the inverse 
transform: we will want a bound on ||^ _1 y -<P _1 y|| over all integer vectors y. This 

bound will not in general be the same as the bound for the forward transform, but it is 
closely related: for any such y, if we let x = <p _1 y , then 

I A _1 y - ^ _1 y I = Vx - x|| 

= |U~V X - A _1 Ax 

(1.3) 11 11 

^A-'WWAx-cpxW 

A similar computation gives ||Ax-<px||<||A||||A _ V-<P _1 y||, so 

|a-V-*-V|M4TV-H- 

Formulas such as (1.2) indicate that, if multiple factorizations of a given 
transformation are available, then the ones with fewer factors are likely to have better 
error bounds (assuming that the error bounds and norms for the factors in the 
factorizations are similar). 

A special factor that will occur frequently in factorizations and is easy to handle is 
a permutation matrix which merely rearranges coordinates or bands. In fact, we can 
generalize this by allowing some of the l's in the permutation matrix to be changed to -l's 
(negating some coordinates or bands). This will be needed because permutation matrices 
can have determinant -1 and we usually want to restrict to matrices of determinant 1. We 
will often refer to such signed permutation matrices as 'permutation' matrices. 

Such a matrix is normally a "free" factor: it is approximable by an integer 
mapping with error 0, and its norm is 1, so it does not inflate the errors from the other 
factors in formula (1.1). In fact, since this matrix gives an isometry both of R" and of 
Z" , it cannot have any effect on the error bounds from a factorization if it occurs first or 
last in that factorization. If it occurs in the middle, then it might have a slight effect on 
the error by affecting the relation between the two factors it lies between. 

Another type of factor which will be fundamental in the following is an 
elementary matrix, which differs from the identity matrix only at a single off diagonal 
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entry. Applying such a matrix has the effect of adding a multiple of one coordinate to 
another coordinate. 

If A is a 2 x 2 elementary matrix, say 



A = 



1 <*n] 
0 1 J' 



x l +a 12 x 2 

x 2 



then let 



pj^l j = ^l+«12^2>j = ^l+(«l2^2>j 

where (y) is y rounded to an integer in a consistent way (say, round to nearest 
with half-integers rounded upward) so that, for any integer n and real number y, 
y,{n + y) =n + (y). (Consistency is needed only if we want to think of the mapping (p as 
"apply A and then round all coordinates to integers." If we are willing to forget about 
(xi+ a\ixi) and go straight to .^+(^2X2), then any rounding function can be used.) 
Then we have || Ax - <j?x|| < 1 / 2. And (p is invertible: if 



then x 2 = y 2 and x l = y 1 -(ai 2 x 2 ). (Note that y x ~{a l2 x 2 ) might occasionally 
be different from y l + (-ai 2 x 2 ) . In other words, the inverse of the integer approximation 
of A need not be the same as the integer approximation of the inverse of A, because the 
rounding is done slightly differently. However, for elementary matrices the differences 
should be rare, occurring only when the number to be rounded is equal to or very near a 
half-integer.) 

We will see in the section entitled "Larger Matrices" that unit triangular matrices 
(i.e., lower or upper triangular matrices whose diagonal entries are all 1) are as suitable as 
elementary matrices for the purpose of obtaining integer approximations. 

One can think of the usual full Gaussian elimination process as factoring a matrix 
into elementary matrices, simple permutation matrices (for pivoting), and a diagonal 
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matrix. For most applications of such factorizations, the elementary matrices are the ones 
requiring attention, while the permutations and diagonal matrices are trivial and can be 
ignored. The present situation is an exception; elementary matrices (and permutation 
matrices) are easy to handle directly, but diagonal matrices are not. We will investigate 
the 2 x 2 diagonal matrices extensively in the next section. 

The linear transformations and the corresponding integer approximations may 
change the range that the coordinates vary over — an approximation which maps integers 
to integers need not map 16-bit integers to 16-bit integers. It is easy to determine the new 
ranges after the linear transformation: if the transformation is given by A = (ay) and 

input coordinate number j is bounded in absolute value by bj for each j, then output 
coordinate number i is bounded in absolute value by " Ja^-j&y. (Similar bounds can 

be computed if the input coordinates are restricted to intervals not symmetric around 0.) 

Since the integer approximation is supposed to be within a fixed distance of the 
linear transformation, one can easily adjust these bounds to get bounds for the 
approximation. (However, intermediate results may not lie within these bounds; one may 
need to compute ranges for the individual factor matrices in order to bound these.) 
THE 2x2 DIAGONAL MATRIX 

As mentioned in the previous section, diagonal matrices, which are trivial for 
most applications, are quite nontrivial when it comes to integer approximations; here we 
will factor them into matrices which can be approximated directly. We may assume that 
the given diagonal matrix has determinant 1 . Furthermore, if we have an n x n diagonal 
matrix of determinant 1, we can factor it into simpler diagonal matrices of determinant 1 
each having only two nontrivial diagonal entries: the n = 3 case is 





0 0^ 




'd x 


0 


0" 


/j 


0 o N 


0 


d 2 0 




0 




0 


0 


d x d 2 0 


0 


0 d 3j 




0 


0 


1 


0 


0 d 3 



and larger matrices are handled similarly. So we can concentrate on the 
determinant- 1 2x2 diagonal matrix 
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(2.1) 



fa 0 
D -(o a- 

We may assume or> 0, since otherwise we can just pull out the scaling factor 

1 0 
0 -1 

which is "free" (i.e., just negate everything at the end). 
We can factor D into four elementary matrices: 



1 r 1 OVt -rcC l \n 0 
0 ills lL 1 \\-sa 1 



o 

D 

a 

Sl where r and 5 are chosen so that rs + 1 = a, or 



(2-2) 



1 (o l](-sa l) o 



|7j where + 1 = a: 1 Any such factorization leads to a bounded-error integer 

G3 10 approximation for D. A plausible choice for r and 5 would be to "balance" the factors by 
requiring |r| = |s«|, but it is not clear that this will minimize the error bound. The 

factorizations appearing in the prior art are (2.1) with r = a-a 2 and (2.2) with s = -1. 
Or we can factor D into three elementary matrices and a 'permutation' matrix: 



(2.3) 



15 (2.4) 



•f- M rn 



0 -lVl cT 1 Yl 0 

1 0 I o 1 I -« 1 



One can modify these so as to have the 'permutation' matrix appear at a different 
place in the factorization; this will not affect the error bounds we obtain here. 
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The factorizations (2.3) and (2.4) are closely related: one can get (2.4) from (2.3) 
by interchanging the two coordinates throughout and replacing a with or 1 . Note that 
"interchanging the two coordinates" is equivalent to conjugation by the reverse-diagonal 
matrix 




Similarly, one gets (2.2) from (2.1) by interchanging the two coordinates, 
replacing or with cr 1 , and interchanging r and s. 

Note that these error bounds apply to D in isolation. If the diagonal matrix is just 
one part of a larger transformation, then when the parts are put together one can often 
combine factors such as adjacent elementary matrices with the nonzero entry in the same 
location; this normally will reduce the resulting error. 

The factorizations (2.1)-(2.4) of D can easily be inverted to give factorizations of 
D" 1 into elementary factors. As noted earlier, the integer approximations for these 
inverse factorizations are not quite the same as the inverses of the integer approximations 
for the direct factorizations. However, in the 2x2 case the differences only show up 
when a result lies exactly halfway between two integers and must be rounded to one of 
them (assuming rounding to the nearest integer). Since the analysis here will not depend 
on how such choices are made, we can do error analysis of the inverse factorizations to 
get error bounds for the inverse transformation. 

It turns out that we do not need to do any additional work to get the bounds for the 
inverse transformation here. The inverse of (2.3) is: 

° 1= (« J (o i ] (« J (-i o) 

Conjugating this by the matrix 
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(an isometry which will not affect error bounds and which commutes with the 
given diagonal matrix) gives the formula 



This is exactly the same as (2.3) except that the 'permutation' occurs at the left 
(the end) instead of the right (the beginning). It follows that any error bound we obtain 
for the forward transform for (2.3) will also be a bound for the error in the inverse 
transform. 

The same reasoning shows that any forward transform error bound for (2.4) is also 
an inverse transform error bound. The inverse of (2.1) turns out to be (2.2) with a, r, s 
replaced with of* 1 , sot, ra~^ respectively, so we can obtain inverse transform error, 
bounds for (2.1) from forward transform error bounds for (2.2); similarly, forward 
transform error bounds for (2.1) give inverse transform error bounds for (2.2). 

One can work out that 



Using this and the combination formula (1.1), one can compute an error bound for 
the integer mappings coming from each of the factorizations (2.1)-(2.4). (Each of the 
2x2 elementary matrices has an error bound of 1/2 unless it has integer entries, in which 
case it has error bound 0.) However, the formulas for these bounds are rather messy; for 
instance, the error bound for (2.3) is 
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As a useful example, let us consider the special case a — V2 . In this case the 
error bounds for (2.3) and (2.4) become respectively 

^ (2 + ( V3 + 1)( V2 + 2)) * 2.83 195 12300735069, 

i(2 + 4l + S ) = 2.573 1321849709862. 

For (2.1) and (2.2) with a = y/2 , the error bound C is still a messy function, this 
time of the parameter r (from which s is computable: s = (a-l)/r or s = [a~ X -lj/r . 

Note that we may assume r > 0, since negating r and 5 yields another valid factorization 
without changing the norms of the factors. As previously mentioned, a plausible choice 
of r is r = V2-V2 -0.7653668647301795 for (2.1) and 
r = - 1 » 0.6435942529055826 for (3.2); these yield error bounds of about 
3.4167456510765178 and 3.1662988562277977, respectively. Numerical optimization 
on r yields the following values: 

for (2.1): r « 0.4811840837330633, C « 3.2403445213547604; 

for (2.2): r - 0.8035642481527317, C = 3.1303571980433588. 
Using more detailed calculations, we can get error estimates more precise than 
those obtained from norms of matrices. Look at a particular elementary factor matrix Af . 
If the nonzero off diagonal entry of Af is in the upper right, and if cp-x is the result of 
rounding A^x to integer coordinates, then for any integer vector x t the error vector 
(p{Ki - A-Ki will have the form (e,- , 0) , where \e t | < 1/ 2. Combining this for all factors in 
a product A = A x A 2 • \ , we get 

d = qj^ ■ ■ ■ (p k x - Ax = d x + Ajd 2 + AiA 2 d 3 + • • • + A X A 2 ■ ■ ■ A^d* , 

where 



dj = qtfPi +\(p i +2---(p k x- Aj<Pj +l(pi+2---(p k x 



N 5 

ru 



is of the form (^-,0) if A,- is elementary with upper right entry nonzero, 
dj =(0,ej) if A,- is elementary with lower left entry nonzero, and d ? - is the zero vector 
(i.e., term number i in the error bound is omitted) if Aj is an integer matrix. We can now 
find the maximum possible value of |[d|| subject to the constraint that |e z |<l/2 for all i, 
and this will give an error bound for the integer approximation of A. 

For the factorization (2.3), we get 

^2 + e 3 a 



d- 



Q Clearly we maximize ||d|| by letting e 2 ,e 3 ,e 4 all have absolute value 1/2, with e 2 

a 

CP and e 3 having the same sign and e A having the opposite sign. This gives the error bound 



||d||<I(l + a-')^. 



Because of the known relation between (2.3) and (2.4), we get the error bound 
for (2.4) by replacing a with a" 1 in the error bound for (2.3): 
||d|<i(l + a)Vl+er 2 . 

These two bounds are actually equal. In the case a = -j2, the common value of 
the bound is i S (2 + V2 ) - 1 .4783978394802332. 

For (2.1) we have 

[ e l + e2 r + e 3 a 
e 2 + e?,s + e 4 a~ 

This leads to case distinctions based on the signs of a - 1, r, and s. As before, we 
may assume that r > 0; this means that a - 1 and s have the same sign, since rs = a - 1. 

If a> 1, and hence s > 0, then clearly |d|| is maximized when the errors e t are all 

1/2 or all -1/2. So the error bound is: 
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Ml 



r + a + l)) 2 +(r(a+l)+a(a-l)) 2 



One can actually work out the critical points of this function of r (holding or fixed) 
by solving a fourth-degree polynomial equation, but it is probably more convenient to 
find the optimal value of r numerically. 
5 If a < 1 (so s < 0), then the choice of signs for the errors e t is much less clear; 

aligning them to make one component of d maximal will cause cancellation in the other 
component. One must consider the various possibilities to see which yields the longest 
t A error vector for given values of a and r. (It is easy to see that e t should be ±1/2, because 

the most distant point from the origin on a line segment is always one of the two 

k-J 

p 10 endpoints; only the signs of the numbers e i are unknown.) 

The situation for (2.2) is reversed: for a < 1 one can get a single formula for the 
maximal error, but for a > 1 one must look at various cases. 

Again consider the example a - 42 . For (2.1) we have a single error formula 
and can proceed directly to numerical optimization to find that the best value for r is 
15 about 0.5789965414556075, giving an error bound of about 1.9253467944884184. For 
(2.2), the error bound is the maximum of four separate formulas; it turns out that this is 
minimized where two of the formulas cross each other, at r = 
V2V2 - 2 / 2 - 0.4550898605622273, , and the error bound is 
Vl2 + 18V2/4 * 1.5300294956861884. 
20 We still have to consider special values of r and 5 where one of the four matrices 

in (2.1) or (2.2) is integral, and hence the corresponding e t becomes 0. Among these are 

several cases giving an error bound matching the value i>/3 (2 + V2~)from (2.3), and two 

cases which give even better bounds: putting r = a/2 in (2.2) gives an error bound of 
^21 + 8V2 / 4 » 1 .421 1286997265756 , and putting r = l-l/V2 in (2.2) gives an error 
25 bound of 4$ + 2V2 / 2 = 1.3614526765897057 . 

Even this does not exhaust the error analysis for (2.1)-(2.4). The error bounds 
obtained above are not sharp, because the errors e t are not actually independent of each 
other. For instance, in the computation for (2.3), e 2 is not independent of e 3 and e 4 : one 
can show that e 2 + «e 3 - e 4 must be an integer. (If we start with an integer vector x = (xl, 
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x2), then the second component of <p 4 x is b = x 2 + x^cc + e 4 and the second component of 
(p 2 cp 3 (p 4 x is b' = xiOt + e 3 a + e 2 ; these are both integers, so b' - b + x 2 - e 2 + cce^ -e 4 is an 
integer.) Using this, we can get the following error bound: 

|^(a+l) 2 +cT 2 (2[a/2]-l) 2 if col, 
^(a- 1 + l) 2 + (2r«/2l-l) 2 ifa<l. 

Replace or with or 1 to get the corresponding error bound for (2.4). For a -^2 the 
error bounds for (2.3) and (2.4) are Vu + Wf/ 4 = 1.2578182623839374 and 
V4 + 2a/2/2= 1.3065629648763765, respectively. 

Such improvements for (2.1) and (2.2) are available only for special values of r 
and s, and depend highly on the specific form of these numbers and of a. For instance, in 
(2.2) for a = and r = 1 - 1/ V2 , this method gives an error bound of >/4 + 2>/2 / 2 , the 
same as for (3.4) above. 

These final error bounds for (2.3) and (2.4) are provably sharp when a is 
irrational, as is the above bound for the instance r = 1-1/V2 of (2.2). So (2.3) appears 
to give the best results among these methods when 

If a is rational (and, in the case of (2.1) and (2.2), the parameters r and s are also 
rational), then the errors from the integer approximation are periodic in both coordinates, 
so one can perform a finite computation to get the exact error bound for a particular 
factorization. 

One can obtain other integer approximation methods for rational a by 
constructing a finite configuration on a rectangle in the plane and extending 'periodically' 
to get the full mapping. Even for irrational a, where a 'periodic' solution is not available, 
one can still use integer approximation methods completely different from those obtained 
via factorization into elementary matrices. 

LARGER MATRICES 

As noted before, one can use Gaussian elimination to factor an n X n matrix of 
determinant 1 into elementary matrices, permutation matrices (or 'permutation' matrices 
of determinant 1), and a diagonal matrix of determinant ±1; we may assume that the 



determinant of the diagonal matrix is 1, because we can transfer a negation to one of the 
'permutation' factors. A diagonal matrix of determinant 1 can be factored into simpler 
diagonal matrices which have only two entries different from 1, these entries being 
reciprocals of each other; these simpler matrices can then be factored as in (2.1)-(2.4). So 
we know that any matrix of determinant 1 can be factored into integer-approximable 
factors. 

But this process would yield a very large number of factors. The number of 
factors can be drastically reduced if we work with a family of factor matrices more 
general than the elementary matrices but still allowing easy bounded-error integer 
approximations. The matrices we will use here are the unit triangular matrices, which are 
(lower or upper) triangular matrices whose diagonal entries are all 1. (Note that any 
elementary matrix is unit triangular.) FIGURE 2 illustrates the process followed to 
generate the elementary matrices, permutation matrices and diagonal matrix used in the 
present invention. 

Suppose we have a unit upper triangular matrix 





a l3 • 


" a ln > 


) 1 


a 12 ■ 


■■ a 2n 


) 0 


1 • 


- a 3n 


) 0 


0 


•• 1 



so that 



fxl^ 




' x x + a 12 x 2 + a l3 x 3 + ■ • • + a ln x n ' 


x2 




X2+a 2 3X 3 +--- + a 2n x n 


x3 






< x n ) 







Then we can approximate U by 
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V 




x 2 




x n 





' x 1 + (a l2 x 2 + a 13 x 3 + ■ ■ ■ + a ln x n > 
x 2 + (a 23 x 3 + --- + a 2n x n ) 



This will give \\llx - (px\\ < y/n-l 1 2 for all integer vectors x. And (p is invertible: 

if 



X n _ l = y n -\-(an-l,n x n)> 

X n _ 2 = V„_ 2 -{ a n-2,n-\ x n-l +a n-2,n x n)' 

x \ = Jl - <«12^2 + • • • + a \n x n)- 



Note that (p can be computed in place (output entries overwriting input entries) if 
the output entries are computed in the order y h y 2 ,.~, J„; qr 1 can also be computed in 
place, with the results computed downward from x n to xj. 

Again we find that qr 1 is not the same as the integer approximation to the matrix 
U A (which is also unit upper triangular). The difference here is more substantial than in 
the 2 X 2 case. For instance, if n = 3 and we apply the approximation for U and then the 
approximation for U' 1 to the starting integer vector (x h x 2 , x 3 ), the first coordinate of the 

result will be 

x 1 +(a l2 x 2 +a 13 x 3 ) + (-a l2 x 2 -a u x 3 + a 12 (a 23 x 3 -(a 23 x 3 ))), 
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which is quite likely to be different from x x even without boundary effects in the 
rounding rule. (In fact, the recursion displayed above in the computation of qr l tends to 
result in larger error bounds for the inverse transform than for the forward transform.) 

The same approximation method works for a unit lower triangular matrix; one 
5 merely has to compute the output coordinates in the reverse order. (Again, one can 
convert between upper triangular and lower triangular using conjugation by a reverse- 
diagonal matrix /.) Actually, there are variants where the coordinates are computed in 
any specified order; these are obtained by combining a unit triangular matrix with a 
'permutation' matrix. 

H= 10 Since a general matrix of determinant 1 can be factored into elementary matrices, 

n it certainly can be factored into unit triangular matrices. The main question now is how 

2 many unit triangular factors are required in general. A quick lower bound can be 

%0 obtained by counting degrees of freedom (free parameters). The general matrix of 



y3 



determinant 1 has n 2 -1 degrees of freedom. A unit triangular matrix has In -nj/2 



U 15 degrees of freedom; hence, at least three such factors are needed to handle the general 

? !J 

q matrix (assuming n > 1). 

W Note that a product of unit upper triangular matrices is unit upper triangular, and a 

2 product of unit lower triangular matrices is unit lower triangular. So, in a factorization 

into unit triangular matrices, we may assume that the two types of matrix alternate. 
20 We just saw that a product of two unit triangular matrices is not general enough to 

give an arbitrary matrix of determinant 1. It turns out that the family of matrices that can 
be expressed as a product of two unit triangular matrices, say in the order LU, is an 
interesting one. One of ordinary skill in the art would be capable of expressing matrices 
as a product of two unit triangular matrices since such results are related to the method of 
25 expressing the diagonal entries of the upper triangular factor in a standard LU- 
decomposition as quotients of determinants of leading square submatrices of the original 
matrix. 

Proposition 3.1. Annxn matrix A = (ajj) nn can be expressed in the form LU (L 
unit lower triangular, U unit upper triangular) if and only if for each k<n, the leading k 
30 X k submatrix of A (i.e., the upper leftkxk submatrix of A, or iaipk,h or A * k x k ) has 
determinant 1. 
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Proof. It is easy to see from the special forms of L and U that (LU) t k x k= 
(L r k x k) (U r * x k) for any Jk < n. Since L r £ x £ and £/ r £ x £ obviously have 
determinant 1, (LU) 1 k X k must have determinant 1. 

(<=) Suppose A has the specified leading-submatrix property. If we express the 
unknown L and U as (fy)^ „ and (Cy)„ „ (so = c {i =1,^ = 0 for i < 7, and c t> - = 0 for i > 

j), then LU works out to be 

\ c 12 c 13 c 14 

b 21 1+fc 21 c 12 c 23+^21 c 13 c 24+^21 c 14 

^31 ^32+^l c 12 1 + ^32 C 23 +Z, 31 C 13 c 34 +^32 c 24 +^l c 14 •■• ■ 

% ^42+^41 c 12 ^43+^42 c 23+^41 c 13 1 + fe 43 c 34 +^42 c 24 +^41 c 14 

So we can set and c n so that the entries of LU in the first column (below the 
diagonal) and the first row (right of the diagonal) will match the corresponding entries of 
A. Then we can set b i2 and c 2i so that the remaining off-diagonal entries in the second 
row and column of LU match those of A. Continuing this way, we obtain matrices L and 
U of the required form so that all off-diagonal entries of LU match the corresponding 
entries of A. 

Using the fact that A and LU both have the property that all leading square 
submatrices have determinant 1, we now show by induction on k that the fc'th diagonal 
entry of LU is a kk . Suppose this is true for all k' < k. Then A t k X k and (LU) \ kxk 
agree except possibly at the lower right entry. If we treat a kk as an unknown, then the 
equation det(A t k X k) = 1 is a linear equation for a kk and the coefficient of a kk is 
det(A r (k - 1) x (k - 1)) = 1, so a kk is uniquely determined. The lower right entry of (LU) 
\ kxk satisfies exactly the same equation, so it must be equal to a kk . 
Therefore, A = LU, as desired. 

The right-to-left direction could be proved more briefly as follows: given the 
matrix A, perform the standard LC/-decomposition (no pivoting is needed) to write A as a 
product of a unit lower triangular matrix and a general upper triangular matrix; then, from 
the above, the diagonal entries of the second factor will all be 1. A more explicit proof 
was presented above to show the simplifications that arise in this special case. 
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In fact, given a suitable matrix A = (ay)^„, there is a quite simple algorithm to 
compute matrices L = {b { -) nn and £7 = (c y ) Mn as above. Start by setting xy <- a tj for all f, 
< n, and do: 

for k= 1 to n-1 
5 for j = k+1 to n 

(3.1) for; = k+l ton 

x ij *i/ ~ x ik x kj 

Then we will have x y - = 2> y - for i > 7 and x y = c y - for i < j (and x y - =1). 
By reversing the indices both horizontally and vertically, we see that a matrix can 
10 be written in the form UL (with L and U as above) if and only if all of its lower right 
square submatrices have determinant 1. 

To handle more general matrices of determinant 1, we need more than two 
factors. It turns out that three factors (along with a possible 'permutation') will always 
suffice. In fact, since three factors give more degrees of freedom than we need, we can 
15 be more specific by requiring one of the three unit triangular factors to have a special 
form. 

Proposition 3.2. Let A = (a y )„ n be annXn matrix of determinant 1 such that all 
of the submatrices (flj+y)^ for 1 < k < n-1 have nonzero determinant. Then A can be 
written in the form UiLU where U\ and U are unit upper triangular, L is unit lower 
20 triangular, and the only nonzero entries ofUi are on the diagonal or the top row. 

Proof We first find a matrix A' differing from A only in the first row so that all 
leading square submatrices of A' have determinant 1. Let a[i ,a{ 2 ,---,a{ n denote the 
unknown entries of the first row of A'. Then, once we know a{ l ,a{ 2 ,...,a[^_ l , we can 
determine a[ k so that A' t kxk will have determinant 1. This condition is a linear 
25 equation for a{ k and the coefficient of the linear term is ±det (a i+ ij)^_i^.i (define this to 

be 1 for k = 1), which is nonzero by assumption, so there is a (unique) value which works 
for a[ k . So we can proceed from left to right to determine all of the unknown entries of 

A'. 

If we can find a matrix U\ of the required form so that A = L^A', then we can use 
30 Proposition 3.1 to express A' in the form LU for unit triangular matrices L and U, giving 

LEA\18255APDOC "21- 



A = U X UJ as desired. Let l,u 2 ,u 3 ,...,u n denote the entries of the first row of U x . Then 
the unknown values u t must satisfy the equations 

a2\U2+a 3] u z + --- + a nl u n =an-fln, 
«22 M 2 + a 32 M 3 + * * * + a n2 u n = a \2 ~ a \2 > 

a 2n u 2 + a 3n M 3 + • • • + a /m M n = a ln ~ a \n- 



m 

U 
fii 

G 

s ! 

Q 



If we just look at the first n-1 of these equations, then the (n-1) X (n-1) matrix of 

coefficients is (a.+i , which has-nonzero determinant, so there are unique 

V it*-,J } n -.\^ n -\ 

numbers u 2 , ■ ■ ■ , u n satisfying these n-1 equations. This means that the resulting matrix 
Ui will be such that t/jA' agrees with A everywhere except possibly at the upper right 
corner entry. But A and U^A' both have determinant 1, and the cofactor of the upper right 
corner in the computation of these determinants is det l n _ x * 0 » so A and U 1 A ' 

must in fact agree everywhere. 

The special case n = 2 of this proposition is of interest; it states that any 2x2 

matrix 

A= ( a n a l2 

[ a 21 a 22 



of determinant 1 with a 2 \ * 0 can be written as a product of three elementary 
15 matrices. It is not hard to work out this factorization explicitly: 



(3.2) 











1 


a 22 _1 ^| 


A = 


"21 


f 1 


:) 




a 2l 




0 1 


{ a 2\ 


K 0 


1 



By transposing everything and reversing the order of the factors, we get a similar 
factorization for a 2 x 2 matrix of determinant 1 with upper right entry nonzero. So it is 
only the diagonal matrices which require four factors as discussed above in the section 
20 devoted to the 2 x 2 diagonal matrix. (It is not hard to show that a non-identity 2x2 
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diagonal matrix cannot be written as a product of elementary factors without using at 
least two upper factors and two lower factors - for instance, if only one lower factor is 
used, then the nonzero lower left entry of this factor will be the same as the lower left 
entry of the product.) 

5 A given matrix of determinant 1 might not satisfy the hypothesis of Proposition 

3.2, but this problem can be handled by a small modification of the matrix. Given any 
nonsingular matrix, one can permute the rows so as to get all leading square submatrices 
to have nonzero determinant. (Expand the determinant of the matrix by minors on the 
last column; since the determinant is nonzero, one of these minors has nonzero 

M 10 determinant. So we can swap rows so that the leading (n-1) x (n-1) submatrix has 

Q 

Q nonzero determinant. Now proceed recursively.) Then we can move the last row up to 

the top (and negate it if necessary to restore the determinant to +1) to get a matrix 
4 J3 satisfying the hypotheses of Proposition 3.2. Therefore: 

Theorem 3.3. Any matrix of determinant 1 can be factored in the form \\UyLU , 
15 where IT is a signed permutation matrix, U x and U are unit upper triangular, L is unit 
lower triangular, and the only nonzero entries of Ui are on the diagonal or the top row. 

There is a version of Theorem 3.3 which applies to any nonsingular matrix A: one 
can factor A in the form Yl&iLU where IT is a 'permutation' matrix, L is unit lower 
triangular, U is unit upper triangular, and Ui is a matrix which differs from the identity 
20 matrix only in its first row. (So Ui is like U\ except that the upper left entry of Ui may 
differ from 1.) To see this, first note that the argument just before Theorem 3.3 applies 
here to find a 'permutation' TT such that A = IT A where A is such that the submatrices 
specified in Proposition 3.2 have nonzero determinant. Let d be the determinant of A , 
and let A' be A with its first row divided by d. Apply Proposition 3.2 to factor A' in the 
25 form U X LU, and let U x be U\ with its first row multiplied by d; then we have 
A = II U\LU as desired. 

By tracing through the proofs leading up to Theorem 3.3, one can extract an 
algorithm for factoring a matrix of determinant 1 in the specified form, but it will not be a 
good algorithm. In particular, instead of using trial and error with subdeterminants to 
30 choose a 'permutation' IT, we would like to have a method 300 that works faster and 
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produces more numerically stable results. It turns out that the standard LU- 
decomposition algorithm provides just what is needed. 

FIGURE 3 illustrates a method 300 for generating matrix factors for a given 
matrix A of determinant 1. Gaussian elimination can be performed on this matrix using 
elementary operations and permutations on the rows only (partial pivoting) to reduce the 
matrix to upper triangular form 310. This means that we get the equation UA = LDU , 
where ft is a permutation, L is unit lower triangular, and DU is upper triangular 320 (it 
could be factored further into a diagonal matrix D and an unit upper triangular matrix 
U , but we will not need that here). Note that (LDU) t kxk = (L t kxk) (DU 
r k X k) and the latter two matrices have nonzero determinant (the determinant of DU 
r k x k is the product of its diagonal entries, which is nonzero because the product of all 
of the diagonal entries of DU is DU = (det (nA.))/ (det Z) = ±1 . 

So now we can take UA transfer the bottom row to the top, and negate this row if 
necessary so that the resulting matrix A will have determinant 1; then A is a 'permuted' 
version of A (step 350) which satisfies the hypotheses of Proposition 3.2 . 

Let a = det ft be the sign of the permutation given by n . Then the top row of A 
is the bottom row of ftA multiplied by (-l) n+1 a, and we have A = ITA, where ft is II 
with its bottom row moved to the top and multiplied by (-l) n+1 o (step 340). Now we can 
write A = n A , where IT = FT 1 = tl T . Note that ft and n are 'permutation' matrices. 

Once we have the Proposition 3.2 decomposition U Y LU of A, we will have 
factored A into the form UU^LU, as desired. We will now see that knowing the 
decomposition LDU of ITA makes it much easier to compute the matrices t/j, L, and U. 

The matrix A has the form 



where the numbers a u are the bottom row of UA (possibly negated) (step 330). 
The modified matrix A' from the proof of Proposition 3.2 will have the form 
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«11 «12 a 13 ••• a l« 
{UA)\(n-l)xn 

where the numbers a^- are to be chosen so that the leading square submatrices of 
A' all have determinant 1 (step 350). 

One obtains DU from HA by performing elementary row operations as specified 
5 by L ; each of these operations adds a multiple of an earlier row to a later row. If one 
performs the same operations (shifted down one row) to the lower n-1 rows of the matrix 
A', one obtains the matrix 

o r , , 

!:: «n «i2 a i3 •■• 

p! y DU\(n-\)xn 

m 

Since these operations again only add multiples of earlier rows to later rows, they 
L 10 are still valid row operations when restricted to any leading square submatrix of the 
W matrix, so they do not change the determinants of these submatrices. So if we find values 

for a[i so that the leading square submatrices of A" all have determinant 1, then the 

leading square submatrices of A' will also have determinant 1 . 

Let Vy for \<i,j<n be the entries of the matrix DU ; then v y - = 0 for i >j. Also, 
15 let d k =v n v 22 ---v kk (the product of the first k diagonal entries of DU ; this is equal to 
the determinant of the leading kxk submatrix of DU or of Y1A). Then one can derive 
the following formula for the desired values a[ k : 

(3.3) <4=E(-1)'' +1 ^- 

This can be re-expressed in various ways: since v kk I d k =\l d k _ x , one can 
20 combine the last two terms into ±(v M>k -V)l d k _ x and, instead of computing the products 
d k , one can write the formula in Horner form 

4k =—\ v lk -— v 2k -7~( v 3* — 0 
Vlll V22{ v 33 
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Once we have A', it is easy to factor it into the form LU, as described earlier. It 
now remains to find the matrix C/j so that A = C/jA'. As noted in the proof of 
Proposition 3.2, this requires solving a system of n-1 linear equations in n-1 unknowns 
u 2 ,uo > ,---,u n , and the matrix of coefficients for this system is the transpose of the lower 
left (n-1) x (n-1) submatrix of A (step 360). But this is just ((ITA) r (n-1) X (n-1)) 7 ", and 
the known factorization of UA into two triangular matrices immediately gives such a 
factorization for this matrix: 

((UA) r (n-l)x(n-l)) T =(DU t (n-1) X (n-\)) T (L \ (n-1) X (n-l)) T . 

Using this, we can easily solve for the unknown values u 2 , thus 
completing our desired factorization of A. 

In summary, the method 300 for factoring the determinant- 1 matrix A into the 
formritfjLt/is: 

(1) Use a standard L ^-decomposition algorithm (Gaussian elimination with partial 
pivoting) to find fl , L, and DU so that DA. = LDU . Keep track of the number 
k of row interchanges performed during this process, and let a = (-l) n+l+k . 

(2) Compute UA . (Perhaps this will be done during step (1).) 

(3) Multiply the (unique nonzero entry in the) bottom row of U by a , move this 
bottom row up to the top (moving all the other rows down by 1), and take the 
transpose (i.e., invert the permutation) to get II. 

(4) Let a be the bottom row of flA multiplied by a . 

(5) Compute the numbers o^, a 12 _ _,a ln from DU according to formula (3.3), and let 
a' be the row vector ( a' n , a' n a[ n ). 

(6) Using standard backsolving techniques for triangular matrices (but reversed), find 
the row vector u satisfying the equation uLDU = a - a'. Let U^ be an n x n 
identity matrix with the second through n'th entries in its first row replaced by the 
first through (n-l)'th entries of u. 
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(7) Form the matrix A' consisting of the row a' followed by the first n -1 rows of IIA 
(step 370). Apply (3.1) to A' to compute the entries of the matrices L and U. 
Note that the last entry in the row vectors a and u will not be used and need not 
be computed. Also note that the nontrivial numbers in the matrices U h L, and U can 
easily be packed into a single nxn matrix (with one number to spare). 

The form YlUiLU is only one possible form for a factorization of a given matrix A 
of determinant 1 (step 380); there are many other options for factoring A into unit 
triangular matrices and a 'permutation' matrix. For instance, one could factor A in the 
form UL n UL, where L n is unit lower triangular with nonzero off -diagonal entries only on 
the n'th row. (To get this, reverse the coordinates of A, factor in the form YlUiLU, and 
reverse again. In other words, conjugate by the reverse-diagonal matrix /.) Another 
possibility is the form LUL 1 !!, where L 1 is unit lower triangular with its nonzero entries 
in the first column. (Transpose A, factor in the form YlUiLU, and transpose again, 
reversing the order of factors.) Yet another possibility is to use full pivoting rather than 
partial pivoting in the initial LU-decomposition, leading to a factorization A=YlUiLUn 2 

with two 'permutation' matrices. 

The form U x is particularly suitable for integer approximation purposes, because 

the integer approximation for this factor requires only one coordinate to be rounded; thus, 
the error bound for this factor is 1/2, as opposed to y/n-1/2 for the general unit 
triangular matrix. The same applies to the form L„, but not to the form L 1 . 

But it is good to have as many options as possible, so one can look for a 
factorization that gives the best error bounds for the integer approximation as a whole, 
just as described in the section entitled "The 2x2 Diagonal Matrix." 

PRESERVING PARTICULAR LATTICE POINTS 

In some cases the linear transformation A already sends some integer lattice points 
to integer lattice points. This may be a fundamental property of the transformation, in 
which case it will be highly desirable to have the approximating integer map (p match A 
exactly on these particular points. An example of this situation is presented in the next 
section. 

One particular case of this is handled automatically by the factorization shown in 
the section entitled "Larger Matrices." Suppose that we have Ae x = e h where e Y is the 
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elementary vector with first entry 1 and remaining entries 0. This is equivalent to A 
having first column equal to e x . Then we have A(ke{) = ke x for all integers k, and we 
would like to have (pik^) = kei also. This turns out to be the case: 

Proposition 4.1. Any matrix A of determinant 1 such that Aej = ej can be 
5 factored in the form TlUiLU, where 13 is a signed permutation matrix, U\ and U are unit 
upper triangular, L is unit lower triangular, the only nonzero entries of U j are on the 
diagonal or the top row, and the integer approximation (p to A resulting from this 
factorization satisfies <p(i»i) = ke x for all integers k. 

Proof. Follow the algorithm shown in the section entitled "Larger Matrices." The 
K 10 first step is to use Gaussian elimination with partial pivoting to obtain the expression YIA 

0 = LDU . But the initial matrix A already has its first column in the desired form, so the 

01 elimination will leave the first row alone and process the remaining rows in order to 
'JJ handle columns 2 through n. Therefore, we get ffex, = e 1; and the related matrix n will 
i| satisfy U(ke 2 ) = ke 1 (where e 2 has a 1 in the second position and 0's elsewhere). And the 
J\ 15 matrix A remaining to be factored has first column 

W The entry in position (1 , 2) of A (call it a n ) becomes the entry in position (2, 2) of 

Ta A . When the matrix A' is computed in the next step, its first column is e l + e 2 , and the 

second entry in row 1 comes out to be a' n =a l2 -l. We then get u 2 = -1, and the first 

column of the matrix L is also &i + &2- 
20 So we get the following when applying the matrix A in factored form YlUiLU to 

the vector fa^: U(ke{) = ke h Uke^ = k^ + e 2 ), Ui(Mfii + e 2 )) = ke 2 , and n^) = ke±. 

In the corresponding integer approximation, each step of this process is an integer vector 

anyway, and hence is not altered by rounding. Therefore, we get (fKkei) = ke x for all 

integers k. 

25 Other cases where the matrix A happens to map certain integer vectors to integer 

vectors will probably not be preserved exactly by this integer approximation. However, 
if there is a particular integer vector one is interested in preserving, one may be able to 
apply a preliminary integral linear transformation to move this vector to e l before 
factoring. For instance, suppose that the linear transformation A maps kl to ke h where 1 

30 is the vector with all entries equal to 1. Then we can write A as AA , where A is a simple 
integer matrix of determinant 1 which maps 1 to e h Then we have Ae 2 = e h so we can 
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factor A as above to get a factorization n^Lf/A of A yielding an integer approximation 
(p which sends kl to &ej. 

EXAMPLE: THE flGtf -> rc s C K MATRIX 
As an example, we consider a transformation for conversion of colors presented in 
standard red-green-blue coordinates. Here we will consider only linear changes of 
coordinates, ignoring nonlinear visual effects, which are not relevant for the purposes 
below. A popular coordinate system used for these purposes is the YC B C R coordinate 
system, described in the International Telecommunications Union standards document 
ITU-R BT.601. 

Coordinate systems such as YC B C R may be more desirable for image transmission 
and/or compression, because they decrease wasteful correlations between the three 
coordinates (brighter parts of an image will tend to have higher values for all three 
coordinates) and because coordinate systems in which the most important part of the 
signal (brightness or something like it) is separated out allow different amounts of 
bandwidth to be used for the different coordinates. These purposes would appear 
incompatible with the goal of invertibility; however, it is often desirable for a 
compression or transmission system to be able to operate in either lossless mode or a 
lossy compressed mode, so it is not unreasonable to ask for a lossless transformation from 
RGB to YC B C R . 

The RGB YC B C R conversion is actually a family of linear transformations; a 
particular member of this family is specified by giving weights a R , a G , a B (positive 
numbers summing to 1) for the R, G, and B components. The matrix corresponding to 
these weights is 



2-2a B 
1 
2 



2-2a B 



2-2a„ 2-2a„ 



The determinant of this matrix is not 1 but 
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This is not a serious problem, though, because for decorrelation purposes it does 
not matter if a scale factor is applied to the C R and/or C B output components, and the 
scale factors can be allowed for explicitly in differential data rates. (We do not want to 
rescale the Y component, for reasons given below.) We might as well use the same scale 
factor for both of these components. This means that the first step is to pull out a scaling 
matrix 



leaving a matrix 5" 1 A of determinant 1 to factor. 

The Y output component represents the total luminance (perceived brightness) of 
the specified color. In particular, if the input color is a greyscale value with all three 
components equal to the same number k, then the Y component of the output will be k. 
(This is why Y should not be rescaled.) The other two components are orthogonal to the 
black-white axis; they come out to be zero for greyscale input. In other words, we are in 
the situation described at the end of the previous section: for any k, we have A(kl) = ke Y 
(and hence (S" 1 A) (fcl) = ke h since S fixes e^. To ensure that the integer approximation 
map preserves this property, we start by pulling out a factor A on the right such that A has 
determinant 1 and sends 1 to ej. There are many such matrices to choose from; one 

simple one is 
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5 



'l 0 0 N 
0 $ 0 
0 0 p 



(on the left) from A, where 




-1 0 1 ; 

We are now left with a matrix A = 5 _1 AA _1 to which the algorithms from the 
section entitled "Larger Matrices" can be applied. These yield the factorization 

S-'A^UU.LUA 

where 







1 


0" 




( 1 


-1 




n = 


0 


0 


1 


. u,= 


0 


1 


0 




V 1 


0 


0 J 




0 


0 


1 





f \ 


0 


0^ 




f \ a G -\ 


L = 


1 


1 


0 


u = 


0 1 u 




^0 


h 


1 ; 




^0 0 1 



and the entries t t are given by: 

' 2= ~(2-£,)j8 

3+i _ (2-2a B )^a R 
a G 
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A special case of interest is presented by the set of values provided in the ITU-R 
BT.601 standard, which values are as follows: 

a R = 0.299, a G = 0.587, a B = 0. 1 14. 
In this case, the numerical values of the non-integer entries in the above matrices 

are: 

£ = 0.4860860807044331 a G -l = -0.413 

t x « -0.2034584787387865 f 3 - 0.1560024957269925 

f 2 * -0.6814926851476152 f 4 » -0.0420024957269925 

We can now apply the error analysis methods presented earlier to obtain error 
bounds for this transformation. Note that the integer isometry n has no effect on the 
errors and can be ignored. The integer approximations to matrices Ui and L only involve 
one rounding, because these matrices have only one non-integer row each; so the error 
bound for each of these matrices is 1/2, while the error bound for U (which has two non- 
integer rows) is V2 / 2 . The error bound for the integer matrix A is 0. After computing 
the norms 

p t \\ « 1 .63289649077991 18, 
p^W - 1.6965471828414659 , and 



p^Ul = 1.4371952820317682, 
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we can apply (1.2) to get a forward error bound of 2.5160882629800899. Since 
the inverse to the integer approximation is computed differently, we cannot bound its 
error by applying (1.2) directly; instead we compute J(S _1 A) _1 | - 1.8003448245653902 

and apply (1.3) to get an inverse error bound of 4.5298264824059275. 

One gets better bounds by keeping track of the errors from the separate roundings 
as discussed in the section entitled "The 2x2 Diagonal Matrix". Under the worst-case 
assumption that these errors are independent, one gets error bounds of 
1.5472559440649816 for the forward transform and 1.7941398552787594 for the inverse 
transform. 

One can get lower bounds on the error (and thus gauge the accuracy of the 
preceding upper bounds) by testing the approximation on a large collection of sample 
inputs. One can reduce the computation needed here by using the fact that the 
factorization preserves the mapping kl i-» ke 2 . If one applies the approximation to x and 
to x + kl, then at every step the two results will be the same except that the second result 
will have k added to some coordinates; in particular, the rounding errors will be exactly 
the same. Similarly, the inverse to the approximation will give exactly the same errors 
for input vectors y and y + ke v 

For the forward transform, a search through all input vectors (x 1 ,x 2 ,x 3 )e Z 3 with 
|j^-xi|< 33000 and |x 3 -xj < 33000 (only these relative differences matter) yielded a 

largest error of 1.5404029289484810 at the input vector (19352, 0, 20840). For the 
inverse transform, a search through all input vectors (x 1 ,x 2 ,x 3 )e Z 3 with |x 2 |< 33000 
and |x 3 |< 33000 (the value of x x is irrelevant) yielded a largest error of 
1.7905956082490824 at the input vector (8360, 31316, 8995). These examples show that 
the upper bounds given above are either sharp or very close to it. 

There were a number of choices made in the construction of this factorization (the 
form nU x LU, the particular matrices S and A, and so on). Different choices would lead 
to alternative factorizations, some of which might have better error bounds than the 
factorization presented here. 

MORE GENERAL BIJECTIONS 

As mentioned in the section entitled "Introduction", the approximation problem in 
the fixed-length case is equivalent to finding a bijection ip from a transformed lattice 
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AZ n to the standard integer lattice Z n which moves points as small a distance as possible 
(so that the integer mapping (p = ip ° A is a bijection approximating A). One can imagine 
many ways of trying to find such a bijection, ip ; the problem seems to be a combinatorial 
one. 

Given a matrix A of determinant ±1, we know by now that such maps tjj do exist 
so that the errors (the distances that points are moved) are bounded. Each such ij> has a 
supremal error sup xeAZ „ |^x-x|| (we cannot say "maximal error," because it may be that 
there is no single point x for which |^x-x| is maximal). It is natural to ask whether there 
is a bijection ij> which is optimal in the sense that its supremal error is as small as 
possible; it is conceivable that there would be no optimal bijection, because the ultimate 
error bound could be approached but not attained. This turns out not to be the case: 

Proposition 6.1. For any real nxn matrix A of determinant ±1, there is a 
bijection ^ : AZ" ^Z n which is optimal in the sense that sup^. ||^x - x|| is minimal 

over all such bijections. 

Proof. First find an integer approximation ip x with bounded error, and let e x be 
an error bound for tp t . Then we only need to search among bijections with error bounded 
by e x to find an optimal one ip . If £ a is fixed, then there are only finitely many 
possibilities for ipx for any given xe AZ" (i.e., only finitely many standard lattice points 
within distance £ x of x) and only finitely many possibilities for ip''y for any given 
yeZ\ 

Now a standard compactness argument can be used to complete the proof. There 
are several ways to express this argument. One is to note that the space of integer 
approximations tp satisfying the error bound e 1 can be given a metric (let x h x 2 ,... list 
the vectors in AZ" U Z" , and define the distance between distinct approximations ip and 
if/ to be Ilk where k is least so that tpx k * ^'x* or i/^x* * ^''\ ) so that it becomes a 
compact space, and the supremal error is a lower semicontinuous function from this space 
to the real numbers, so it must attain a minimum value. Another is as follows: Let e be 
the infimum of the supremal errors of approximations ip . For any finite sets S cAZ" 
and S'cZ", there are only finitely many ways to partially define an integer 
approximation i> on S and S' (i.e., define ipx for xeS and ip^y for ye 5') so as to 
meet the error bound e 1 ; so there must be one whose partial error bound on this finite 
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configuration is as small as possible. Since one can find complete integer approximations 
with supremal errors as close as desired to s , the partial error bound must be at most £ . 
So, for any finite parts of the domain and range lattices, we can define t[> and ip' 1 on 
these parts so as to attain the error bound e ; and there are only finitely many ways to do 
so. Now we can apply Konig's infinity lemma to put these together to obtain a complete 
integer approximation attaining the bound e , which is therefore optimal. 

In general, the optimal lattice bijection is not unique. Also, this proof is quite 
non-constructive, and it is not clear that the optimal bijection(s) will be implementable or 
describable in any useful way. 

Let us now examine the case where the matrix A has rational entries. Then the 
transformed lattice AZ" will contain many points that are also in the standard lattice Z"; 
in fact, the intersection L = AZ" fl Z" is a full rc-dimensional lattice. (To see this, it is 
enough to get n independent vectors in L; one can do this by taking the n independent 
columns of A and multiplying each by its least common denominator to get an integer 
vector.) This means that the configuration of points in the two lattices is periodic: the 
configuration at x looks just like the configuration at x + a for any a £ L . 

Now L is a subgroup of Z" of finite index (which can be computed by forming a 
matrix whose columns are n generating vectors for L and taking the absolute value of its 
determinant), and is a subgroup of AZ" with this same index (because the determinant of 
A is ±1). So one can pair off the L-cosets in AZ" with the L-cosets in Z". Any two cosets 
of L are translates of one another, and such a translation gives a bijection between the 
cosets. If we take such a translation from each L-coset in AZ" to the corresponding L- 
coset in Z", we get a bijection ip from AZ" to Z" which is of bounded error; in fact, the 
maximum error is the largest of the norms of the translation vectors used. 

Just like the lattice configuration, the action of the mapping ip looks the same 
near x as near x + a for any a e L . In fact, the displacement tp x - x is a periodic function 
of x. We will refer to such a bijection ip as a 'periodic' bijection, and to the 
corresponding integer approximation (p as a 'periodic' approximation. 

There are only finitely many ways to pair off the L-cosets of the two lattices; and 
for each pair of cosets there are only finitely many translations from one to the other with 
translation vector of norm below a specified bound. (Given any point in the first coset, 
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there are only finitely many points in the second coset within the specified distance of the 
given point; these give the finitely many translation vectors one can try. Clearly there is a 
best translation vector, although it may not be unique. In fact, the pairing between cosets 
can be thought of as a bijection between two finite lattices on the rc-dimensional torus 
R n /L with a suitable metric.) So there are only finitely many 'periodic' bijections meeting 
any specified error bound; it follows that there must be an optimal 'periodic' bijection 
whose maximum error (in the 'periodic' case, a maximum error is attained) is as small as 
possible. It turns out that this is optimal among all bijections: 

Proposition 6.2. For any rational nXn matrix A of determinant ±1, an optimal 
'periodic' integer approximation to A will in fact be optimal among all integer 
approximations. 

Proof. It suffices to show that, if there is any bijection ip from AZ n to TP- 
meeting error bound £ , then there is a 'periodic' bijection meeting error bound s . 

Let m be the index of L in Z n (and in AZ"). Then, for any n-cube of side-length s, 
the number of points of any L-coset in the cube is s" I m + o (s" ) . This means that we can 

find a large cube B and a positive natural number N such that every L-coset contains at 
least iV points inside B and at most N + Nim points within distance e of B (because they 
would lie in a slightly larger cube of side-length s + 2e). 

Now, for any k < m, if we put together k of the m cosets of L in AL n , we get at 
least kN points inside B. These are mapped by ip to at least kN points within distance s 
of B. These image points cannot be included in k - 1 cosets of L in Z n , because 

(k -1)(N + Nim) < kN 

for k < m. So the image points meet at least k cosets of L in 2/ 1 . 

Therefore, by the Marriage Theorem, there is a one-to-one pairing (hence a 
bijection) from the source cosets to the target cosets so that, if C t is paired with Cj, then 
we can find x t e C i such that ifa e C' . Let a t = ^x, -x, ; then ||a.|| < e . Using this coset 
pairing and the translation vectors a,-, construct a 'periodic' bijection ijf\ then ip f meets 
the error bound e , as desired. 
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Propositions 6.1 and 6.2 also work if one is trying to optimize the approximation 
error for the inverse transform, or some combination of the forward and inverse errors 
(e.g., the maximum of the two). Note that the inverse of a 'periodic' approximation is a 
'periodic' approximation to the inverse linear transformation. 

As an example, let us consider the diagonal matrix 




in the simple case a =2. Here the lattice L is just 2ZxZ (i.e., the set of integer 
pairs such that the first coordinate is even), and there are two cosets of L in each of the 
lattices D7/ 1 and Z» Hence, there are only two ways to pair off the cosets; the one which 
gives the smaller error is the one which maps L to (1, 0) + L and (0,1/2) + L to L. This 
yields a bijection V with maximum error 1. The formula for the corresponding bijection 
q> approximating D is: 

if n is even, 
if n is odd. 



f2m + l 
[ nil 
2m 
(n-l)/2 



Note that a greedier algorithm for constructing the bijection might have started by 
mapping all the points in L to themselves (error 0); but then the points in (0, 1/2) + L 
would have to be mapped to (1, 0) + L, leading to a larger overall error of S 12 . Also, 
for this particular example the approximation which is optimal for the forward error also 
happens to be optimal for the inverse error; there is no reason to believe that this happens 
in general. 

For other rational matrices A, there will probably be more cosets to deal with; in 
this case, the implementation of a 'periodic' function will probably be by table lookup. To 
apply the approximating map (p to a given integer vector x, one will determine which 
coset C k of the sublattice A~ l L contains x (which is equivalent to determining which coset 
of L in ATI 1 contains Ax), find in the table a corresponding rational vector a k , and let 
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(px = Ax + a k . Note that, for a general lattice A~ l L, determining which coset contains x 
may not be trivial. It may be more convenient to use a smaller lattice L' c A -1 L of the 
form L' = m,Zxm 2 Zx- -xm n Z; this will make the table longer, but will make it much 
easier to determine which coset contains x. The numbers nij are easily computed: nij is 
the least common denominator of the rational numbers in column; of the matrix A. 

Finding the best 'periodic' approximation is a finite combinatorial search problem. 
There is an algorithm for solving this problem in time polynomial in the number of 
cosets. Determining whether there is a pairing of source cosets with target cosets meeting 
a given error bound (and finding one if there is) is a bipartite matching problem which 
can be solved in polynomial time by network flow methods. The correct optimal bound 
will be one of the n 2 distances between a source coset and a target coset; using a binary 
search, one can find the optimal bound by solving [~21og 2 n~| bipartite matching 

problems. 

Of course, if the number of cosets is too large for the optimal 'periodic' 
approximation to be implemented (let alone found), then one will need to use a different 
approximation algorithm, even if it is suboptimal. 

In order to see how sharp computed upper bounds are, or how close to optimal a 
given bijection might be, it is useful to obtain lower bounds on the possible supremal 
errors of integer approximations or lattice bijections. One way to do this (in fact, by the 
argument of Proposition 6.1, essentially the most general way) is to examine finite parts 
of the two given lattices and show that one cannot even define a partial bijection on these 
finite parts without incurring an error of at least s . 

One finite configuration is easy to use: if the transformed lattice AL n contains a 
point x which is at distance 8 from the nearest point in the standard lattice Z", then any 
bijection from AZ n to Z n must have error at least 8. (The same applies if some point in 
Z" is at distance at least 8 from the nearest point of AZ n .) In particular, if AZ n contains 
points arbitrarily close to the centers of cubes in the standard lattice Z n (this will be true 
if, for instance, some column of A has entries a Xj ,...,a nj such that a lj ,...,a nJ , 1 are 

linearly independent over the rationals), then the supremal error must be at least -Jn 1 2 . 

To obtain better lower bounds, one must analyze the interactions between points 
in the domain lattice - if x * x' and fx = y , then ipx cannot also be y, so it may end up 
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being farther from x'. Such analysis is highly dependent on the particular matrix A. In 
the case of the 2 x 2 diagonal matrix D, one can substantially improve the lower bound: 

Proposition 6.3. If a >0 is given, then, for any integer bisection (p approximating 
the diagonal matrix D, the error sup xez;2 ||Z)x-<px|| must be at least e(a), where: if a > 

1 is irrational, then 

where k = ["(a - 1)1 2~| ; if a > 1 is a rational number m/n in lowest terms, then 

where k is as above; if a = 1, then 6 (a) = 0 ; if a < 1, then £ (a) = £ (a -1 ). 

Proof The case a - 1 is trivial. Any bound which works for a also works for or 1 
(because D(cr l ) is just D{a) with the two coordinates interchanged). So we may assume 
a>\. 

Let £ be the supremal error for (p; we must show that e > e (a). Consider the 
corresponding bijection ip = ipoD' 1 from DZ 2 to DZ 2 , and look at the points of DZ 2 on 
the y-axis. These points are spaced at a distance a" 1 apart, which is too crowded; some of 
them will have to be moved by ip to points not on the y-axis. (This statement may 
appear rather vague. To make it more precise, consider a large finite number 5. The 
number of points of DZ 2 on the y-axis within distance s of the origin is 2[_saJ + l . These 
points are sent by ip to points of Z 2 within distance s + e of the origin; since the number 
of such points on the y-axis is only 2|_s + eJ + l, which is smaller than 2[50:J + 1 for 
large s, ip must map some of these points on the y-axis to points not on the y-axis. The 
statements in the remainder of this proof can be made precise in the same way, but 
actually doing so would make the proof far less readable, so we prefer to state the 
arguments more informally.) 
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In fact, only a fraction 11 a at most of the domain points on the y-axis can be 
mapped to range points on the y-axis. Similarly, for any other vertical line, if) maps at 
most the fraction II a of the domain points on the y-axis to range points on this vertical 
line. 

The number k was chosen so that (2k- l)/a< 1. The map ip sends at most the 
fraction (2k - 1)1 a of domain points on the y-axis to range points with x-coordinate of 
absolute value less than k, because these range points are on 2k - 1 vertical lines. So, for 
the remaining fraction 1 - (2k - 1)1 a of the points on the y-axis, the map rj> introduces an 
error of at least k horizontally. 

If a is irrational, then the vertical distances from points on the y-axis in the 
domain lattice to the nearest points in the range lattice are spread out uniformly over the 
interval [0, 1/2). So, even if we choose the points of least possible vertical error to be 
given horizontal error at least k, the vertical errors will have to range up to at least 
(1 - (2k - l)/a)/2, so the total errors will range up to e (a) . 

If a = m/n in lowest terms, then 1/m of the domain points on the y-axis will entail 
no vertical error (because they are already standard lattice points), 21m of them will entail 
vertical error of 1/m at least, 21m will entail vertical error at least 21m, and so on. If we 
again try to find the minimum possible vertical errors to combine with the large 
horizontal errors, we see that we are forced to use vertical errors up to and including 
[(m - (2k -1) n)l 2 \l m , thus leading to a combined error of e (a) . 

In particular, for a =yf2 no integer approximation can have an error bound better 
than ^22 - 4V2 / 4 = 1 .0 1 06664 1 846 1 9603 ; so the approximation obtained from 
factorization (3.3) is not very far from optimal. And for any a * 1 the error bound must 
be at least 1. 

There is no reason to expect the lower bound from Proposition 6.3 to be sharp in 
most cases; examination of lattice points other than those on the one line considered in 
that proof could show that larger errors must occur. 

The proof of Proposition 6.3 applies to any matrix A having a column whose only 
nonzero entry has absolute value less than 1; a similar argument works to give a lower 
bound on the error when there is a column whose only nonzero entry has absolute value 



greater than 1. This can be generalized to other situations where A maps a rational 
subspace to a rational subspace with the "wrong" scaling. 

ADDITIONAL CONSIDERATIONS 

A number of matrix factorization methods for obtaining integer bijections 
approximating given linear transformations on fixed-length vectors have been considered. 
Such bijections exist and are easy to implement, and can be made to have additional 
desirable properties, such as preservation of suitable integer inputs (which are preserved 
by the given transformation). Approximation methods that are not based on simple 
matrix factorizations were also considered. 

There are many possibilities that remain to be explored, including additional 
factorizations of matrices, different integer approximations of matrix factors, more 
integer approximation methods having nothing to do with factorizations, and improved 
error analysis. 

For instance, as noted earlier, unit triangular matrices can produce larger error 
bounds for the inverse transform than for the forward transform, because the inverse 
transform is computed recursively. One might be able to compute the transform in a 
different way, perhaps doing the recursion in the forward transform for some factors and 
in the inverse transform for other factors, so as to balance out the errors. Or one could try 
to use a different sort of factor matrix which does not have this problem. For instance, 
suppose we partition the coordinates or bands into two groups, and consider two kinds of 
factors: one where linear combinations of first-group coordinates are added to second- 
group coordinates, and one where linear combinations of second-group coordinates are 
added to first-group coordinates. Then recursion would not be needed to invert any of 
these factor matrices, and one may be able to get better overall error bounds. On the 
other hand, degree-of-freedom counting shows that such factorizations would require at 
least four factors in the general fixed-length case, if the length is greater than 2; and more 
detailed analysis shows that even four factors is not enough. It is likely that the 
additional factors will outweigh the benefit from eliminating recursion in the inverse. 

As for the error analysis, even the simple case of a 2 X 2 diagonal matrix was not 
analyzed completely. In more complicated cases the analysis was quite selective; many 
variant factorizations remain to be examined. And everything was based on the initial 
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assumption that the goal was to minimize the worst-case error in the integer 
approximation of the transform (and perhaps the inverse transform). Some applications 
may entail optimizing with respect to some other parameter, in which case different 
integer approximations may work better. 

UNBOUNDED LENGTH VECTORS 

As discussed earlier, the second version of the problem involves analyzing input 
vectors for signals having unbounded length (number of coordinates), but which are of 
bounded amplitude (i.e., the values appearing as coordinates of the vector are bounded). 
Such signals are treated as bounded sequences of real numbers that are essentially infinite 
in both directions. In practice, however, the signals will be of finite length and boundary 
conditions will be needed at the ends of these sequences. 

The use of infinite sequences of real numbers imposes two restrictions. The first 
restriction is a time-invariance condition. Strict time invariance or shift invariance would 
require that shifting the input signal over by one step would result in the same output 
signal also shifted over by one step. This is too strong, though; instead we require that 
the coordinates of the output signal be obtained by applying n time-invariant 
transformations in rotation (so shifting the input n steps results in the same output shifted 
by n steps). This can also be expressed as applying n time-invariant mappings or 
"filters," taking only every zi'th coordinate of each output signal ("downsampling" ), and 
merging the results. 

In such a case the output signal consists of n different subsignals or "bands" 
merged together. One can also treat the input signal as comprising n bands in the same 
way. So the input signal is conceptually broken up into blocks of length n; the/th band 
consists of the j'th component of each block. (Sometimes the input signal is presented in 
n separate bands already.) So the input and output signals can be thought of as essentially 
infinite sequences of members of R" , and the linear transformation as a fully time- 
invariant mapping in this formulation. 

The second restriction is that a component of the output signal depends on only 
finitely many components of the input signal; a transformation with this property is called 
FIR (finite impulse response). A time-invariant (or n-fold time-invariant as above) FIR 
linear transformation must produce a bounded-amplitude output signal when applied to a 
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bounded-amplitude input signal. The part of the input signal on which a given output 
coordinate depends (the "stencil" of the transformation) will often include more than n 
coordinates. 

A linear transformation with these properties can be described by nxn matrices 
5 M k for k g Z , only finitely many of which are nonzero. The input signal x is a sequence 
of n- vectors x, , and the output signal y = /(x) is a sequence of n-vectors y, ; these are 
related by the formula 

k i 

(the sums are over all integers, but only finitely many terms are nonzero). This 
O 10 can be more conveniently expressed in terms of the z-transform, which we think of here 
?S as a generating function approach. If we introduce the generating functions p(z) = 

jjj' ^ x ,z' and q(z) = J J J z ' for the in P ut and out P ut si S nals (these can be thought of as 

kQ series of n-vectors or as n-vectors of series), and we also define the matrix A(z) to be 

*0 J ^ M k z ~ k , then the formula above becomes simply q{z) = A (z ) p(z) ■ The z-transform 

^ 15 matrix A(z) (commonly called the polyphase matrix of the transformation) is a matrix 
\ '% whose entries are Laurent polynomials over R, i.e., members of the ring R[z, z _i ] . If no 

negative powers of z occur in the matrix, then the output vector at timej depends only on 
^ the input vectors at time j and earlier times (the transformation is causal). Just as for 

fixed-length transformations, composition of transformations here corresponds to 
20 multiplication of the associated z-transform matrices. 

We will assume that the given linear transformation is invertible and that the 
inverse transformation is also FTR (it is automatically linear and time-invariant). In this 
case, the original transformation is said to admit perfect reconstruction. So the inverse 
transformation is also given by a z-transform matrix B(z), and if p(z) and q(z) are as 
25 above, then we have p(z) = B(z)q(z) = B(z)A(z)p(z). Since this holds for all input signals, 
B(z) must be the inverse matrix to A(z). 

We will require our integer approximation maps to be FIR and time-invariant (on 
n-vectors), but not necessarily linear. And we impose the same restrictions on the inverse 
maps. 

30 In order to measure the error of an integer approximation, we need a norm on the 

space of signals; the Euclidean norm does not apply to infinite-length signals. Since we 
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are working with bounded-amplitude signals, we could simply take the supremum of the 
absolute values of the components of the signal. But since we are thinking of the signal y 
as a sequence of vectors y, e R", it is natural to define the norm ||y|| to be supjyj. 

Then the error of an integer approximation (p to a given linear transformation A is just the 
5 supremum of |Ax-<px| over all input signals x. (We will abuse notation slightly by 

using A for a transformation and A(z) for its z-transform matrix.) 

INVERTIBLE INTEGER-TO-INTEGER SIGNAL MAPPINGS 
As we discussed earlier, in the fixed-length case, a necessary condition for the 
existence of a bounded-error integer approximation cp to the linear transformation A is 



K 10 that det A = +1 . We may as well assume that the determinant is 1, because, if it is -1, we 

Q can negate a row of the matrix to change the determinant to + 1 . 

P 

S In the unbounded-length case, the linear transformation is given by a matrix A(z) 

tp over rT^"" 1 ! . We are assuming that the inverse transformation is also given by such a 

e L 

matrix, which must be the inverse of A(z), so detA must be an invertible element of the 

L 15 ring R[z, z -1 ] , i.e., a nonzero monomial cz k . 

2j If we look at an integer input signal that is constant on each band, then the output 

W signal will also be constant on each band; this essentially reduces to the case of vectors of 



fixed length n. The constant matrix for this fixed-length transformation is just A(l). 
Since an integer approximation for general signals certainly gives an integer 
20 approximation for these particular signals, the matrix A(l) must satisfy the necessary 
condition above, det A (l) = ±l. So the monomial det A(z) must be ±z k for some integer 

k. Again we can pull this factor out of one of the bands to reduce to the case of a 
transformation of determinant 1; an integer approximation for the modified matrix easily 
yields one for the original matrix (just shift and/or negate one band at the end). 

25 As described earlier, the main approach will be to factor a given z-transform 

matrix into matrices of special form, mainly 'permutation' matrices (ordinary permutation 
matrices with some entries negated) and elementary matrices. The 'permutation* matrices 
are easy to handle, because they already map integer signals to integer signals (they just 
rearrange and possibly negate the bands). 

30 An elementary matrix factor (differing from the identity only at a single off- 

diagonal entry) corresponds to a transformation which adds a multiple of one band (or, if 
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the off-diagonal entry has several terms, multiples of shifted copies of one band) to 
another band. Factorizations into such matrices have been considered by a number of 
those skilled in art, such factors, at least in the 2 x 2 case, are also known as liftings. 

If a transformation is given by an elementary matrix which adds some 
5 modification (combination of shifts and constant multiples) of band i to bandy, then we 
get an integer-to-integer approximation to the transformation by simply rounding the 
modification of band i to an integer before adding it to band j. This is easily invertible: 
simply subtract the same rounded modification of band i from band j. This applies more 
generally to matrices given by unit triangular matrices (lower or upper triangular matrices 
10 whose diagonal entries are all 1). 

A number of the calculations presented earlier can be applied without change in 
p the present context, given suitable definitions. In particular, we define the norm ||A|| of a 

J signal transformation A (or the norm \\A(z)\\ of its associated z-transform matrix) to be 

*P the supremum of |A,x|/||x| over all nonzero bounded inputs x (where is defined as in 

-7 15 the preceding section). Then, if A = A i A 2 ■■■A k where each A t can be approximated by an 
/ integer mapping tp t with error bound C l , then A can be approximated by the composition 

- of these integer mappings with error bound 

a < 91 > c 1+ SA||c J +[A»lhIc,+-+lAllKi-|K-Ic«. 

In fact, this bound can be slightly improved to: 

20 (9.2) c 1 +|a|c 2 +1aa||c3+-+|aa-A-i||q- 

Also, if (p approximates A, then qr l approximates A" 1 , because if x = qrty, then 
(9.3) J A||" 1 J Ax - px|| < J A- ] y - q>^y\ ^ |K || II Ax " HI • 

In this section, we will concentrate on one-dimensional signals, but the methods 
are also applicable to multidimensional signal transformations (i.e., to matrices whose 
25 entries are Laurent polynomials in several variables rather than the single variable z). In 
particular, elementary matrices are approximable by integer bijections as above even in 
the multidimensional case. The main difference is that it is more difficult if not 
impossible to factor a given multidimensional matrix of determinant 1 into elementary 
matrices. 

30 
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FACTORING A z-TRANSFORM MATRIX 
The Gaussian elimination method for factoring a matrix over R into elementary 
matrices and a diagonal matrix (and maybe a permutation matrix as well) can be extended 
to the case of matrices over R[z,z _1 ] . This is the Laurent polynomial version of the 

algorithm for reducing a matrix polynomial to Smith normal form. The Smith normal 
form and a variety of methods for reducing a matrix polynomial to Smith normal form are 
known by those of ordinary skill in this field, where such methods involve, for instance, a 
Laurent polynomial case for 2 x 2 matrics and a case for n x n matrices. 

Here we are concerned with the perfect reconstruction case, so we assume that the 
determinant of the given matrix A(z) is a nonzero monomial. In fact, by pulling out a 
diagonal matrix factor to begin with, we can reduce to the case where det A(z) = 1 . The 
entries in such a diagonal matrix represent scaling (the numerical coefficients) and delays 
or advances (the powers of z) for the corresponding bands. 

The main part of the algorithm uses elementary row operations (each of which 
corresponds to pulling out an elementary matrix factor on the left). Start by selecting a 
column to reduce (say, the first column). If this column has more than one nonzero entry, 
then choose two nonzero entries, say a(z) and b(z). Suppose that the 'degree' of a(z) is at 
least as large as the 'degree' of b(z). (Here we define the 'degree* of a Laurent polynomial 
to be the degree of the highest-degree term minus the degree of the lowest-degree term; 
the 'degree' of 0 is -°° .) Then we can perform an elementary row operation which 
subtracts a suitable multiple of b(z) from a(z) so that the difference has lower 'degree' 
than a(z). (One can actually choose the multiple so that the difference has lower 'degree' 
than b(z). However, it will be useful later to not require this, even if it means that more 
reduction steps are needed.) Repeat this process until all but one of the entries in the 
selected column are 0. Since det A(z) has 'degree' 0, the remaining nonzero entry must 

be a nonzero monomial. 

Now select a second column, and do the same reduction to all of the entries in this 
column except the one in the row containing the nonzero entry from the first column (this 
row is excluded for now). So only one such entry will be nonzero, and again this entry 
must be a nonzero monomial. This means that, with one more row operation, we can 
zero out the entry in the excluded row in the second column. Do the same thing in a third 
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column (now there are two excluded rows), and so on until all columns are processed. 
What remains will be a permuted diagonal matrix, with the nonzero entries being 
monomials with product 1. 

After pulling out a permutation matrix (or a 'permutation' matrix of 
determinant 1), we are left with a diagonal matrix of determinant 1. This can be written 
as a product of diagonal matrices each of which has only two non-1 diagonal entries, 
which are reciprocals of each other. Then, if desired, one can write each of these 
essentially 2x2 diagonal matrices as a product of elementary matrices using the formulas 
discussed previously. In fact, if one really wants to, one can even write the 'permutation' 
matrix of determinant 1 as such a product as well, because such a 'permutation' can be 
written as a product of simple 'permutations' each of which just swaps two rows and 
negates one of them, and such a simple 'permutation' can be written as a product of three 
elementary matrices: 



If one factors all the way down to elementary matrices in this way (leaving the 
'permutation' matrix unfactored), then a great many factors might be required. But it 
turns out that unit triangular matrices are as good for our purposes as elementary matrices 
(simple rounding works as an integer approximation method, as discussed previously); 

using these, one can get by with far fewer factors, because one permuted unit triangular 
matrix can replace a large number (up to n(n -l)/2 of elementary matrices. To see this, 
suppose we are in the process of reducing a column. Say p 1 (z) is the nonzero entry in 

this column of lowest 'degree'; use elementary row operations to subtract multiples of 
Pi(z) from the other nonzero entries to get their 'degrees' below that of p x {z). Let 
p 2 (z) be the newly modified entry of least 'degree,' and subtract multiples of p 2 (z) 
from the other nonzero entries (excluding p l (z)) to reduce their 'degrees' below that of 
p 2 (z). Now choose p 3 (z) and reduce the other nonzero entries, excluding p l (z) and 
p 2 (z); and so on. All of the reduction steps described here can be combined into a 
single permuted unit triangular matrix. 



(10.1) 
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However, there is no fixed bound (depending on n alone) for the number of 
factors needed here, even if these more general factors are allowed; if the entries of the 
matrix have very high 'degree,' then many factors might be required. 

If one is interested in factoring a causal linear transformation (one where no 
negative powers of z occur in the corresponding matrix) into causal elementary factors, 
one can do so by following the same procedure as above, using ordinary polynomial 
degrees instead of 'degrees'. This is just the ordinary reduction of a polynomial matrix to 
Smith normal form. In this case, if the determinant of the matrix has one or more 
factors z, one may not be able to remove them at the beginning; instead one follows the 
Smith normal form process (which is slightly more involved in this case) and ends up 
with a diagonal matrix in the middle of the factorization. If this diagonal matrix has 
determinant ±z k , then one can express it as a constant diagonal matrix of determinant 1 
(which can be factored into elementary matrices, as discussed earlier) and a diagonal 
matrix with entries of the form ±z J (which must be handled some other way). 

FACTORS WHICH PRESERVE CONSTANT SIGNALS 

As described earlier, one can consider the case where the given linear 
transformation already sends certain integer-valued inputs to integer-valued outputs, and 
we want the integer-to-integer approximating map to give the same results for these 
inputs. 

In particular, let us consider the constant input signal with value k on all 
coordinates. Most filter banks are set up with one low-pass filter and one or more higher- 
pass filters. The higher-pass filters should have zero response to a constant signal, while 
the low-pass filter should give a constant nonzero response (preferably the same constant 
as the input). If the low-pass filter appears first in the bank, then the above properties can 
be expressed in terms of the z-transform matrix M(z) for the filter bank by the equation 
M (1)1 = e p where 1 is the vector in R" with all coordinates 1 and ej is the vector with 

first coordinate 1 and remaining coordinates 0. 

We also consider the closely related family of matrices A(z) such that A^e^e^ 

Such a matrix, when applied to an input consisting of a constant signal on band 1 and 
zero on all other bands, returns that input unchanged. (Such a matrix would commonly 
occur for a processing step applied to a signal after it had already been separated into 
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low-frequency and high-frequency bands.) One can convert from a matrix M(z) satisfying 
M (1)1 = e { to the form A(z) by pulling out a constant matrix factor A which sends 1 to e { 
as described earlier: if M (z) = A(z) A , then M (l) = A(l)A , so M (l)l = e t if and only if 
A(l)e,=e 1 . 

The condition A(l)e 1 = e! is equivalent to the statement that the leftmost column 
of A(l) is e t . This means that, in the matrix A(z), all entries in the first column are 
divisible by z - 1 except for the first entry, which has remainder 1 when divided by z - 1 . 

Let Q be the set of all matrices A(z) with entries from R[z,z _1 ] which have 
determinant 1 and satisfy the equation A(l)e, =e,. It is easy to see that Q is a group. 
The set of matrices M(z) of determinant 1 which satisfy M(l)l = ej is the right coset QA 
of £. 

If we have an elementary matrix in Q , then its standard integer approximation 
also leaves a constant integer signal in band 1 (with zeros elsewhere) unchanged. So any 
matrix which can be factored into elementary matrices in Q has an integer approximation 
which preserves constant signals in band 1. 

Theorem 11.1. Any matrix in the group Q can be factored into a product of 
elementary matrices in Q . 

Proof. We perform the same reduction using elementary row operations as in the 
previous section, but with an extra restriction on the operations. When we have two 
nonzero entries a(z) and b(z) in the column we are currently working on, we wish to 
perform an elementary row operation which either subtracts a multiple of a(z) from 
b(z) so as to reduce its 'degree,' or subtracts a multiple of b(z) from a(z) so as to 
reduce its 'degree.' For an elementary row operation to correspond to a matrix in Q , it 
must meet the following restriction: if it subtracts a multiple of row 1 from another row, 
then the multiplier must be divisible by z - 1 . 

If neither a(z) nor b(z) is in row 1, then the restriction does not apply, and the 
usual reduction step is allowed. Now say a(z) is in row 1. If the 'degree' of a(z) is 
greater than or equal to that of b(z), then we can subtract a suitable multiple of b(z) 
from a(z); again this is always allowed. If the 'degree' of a(z) is less than that of 
b(z), then we want to subtract a multiple of a(z) from b(z) so as to eliminate at least 
one leading or trailing coefficient from b(z). (We are not requiring that the 'degree' of 
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b(z) be reduced all the way below that of a(z) ; reducing it by at least one will suffice.) 

So in fact we could make the multiplier a monomial cz k chosen so that the leading term of 
cz k a(z) is the same as that of b(z). But the multiplier cz k -cz k ~ x would also work to 
eliminate the leading term of b (z) , and it would not introduce new trailing terms because 
the 'degree' of cz k " 1 (z-l)a(z) is one more than that of a(z), and hence not more than 
that of b(z). So this multiplier will give a valid reduction step, while satisfying the Q 

restriction. 

Let us require that column 1 be the first column reduced in this way. After 
column 1 is reduced, the remaining nonzero entry in this column must be in row 1, 
because the matrix will still be in Q . Then one can proceed to reduce the other columns 
as described in the previou section; these row operations do not involve row 1, so they are 
all allowed. And the steps for eliminating the remaining entries in excluded rows never 
require subtracting a multiple of row 1 from another row (since row 1 was the first row 
excluded), so they are allowed as well. 

So we can reduce to a permuted diagonal matrix. Since the upper left entry of the 
matrix is still nonzero, the permutation does not move index 1. So one can perform a 
sequence of swap-and-negate operations not involving row 1 so as to reduce to an actual 
diagonal matrix; these operations can be expressed as elementary operations using (10.1), 
and these operations are allowed because they do not involve row 1. 

We are now left with a diagonal matrix of determinant 1 whose entries are 
monomials in z; and the monomial in the upper left corner must have coefficient 1 in 
order for the matrix to be in Q . This matrix can be factored into essentially 2x2 
diagonal factors of the form considered earlier, where each diagonal entry is a monomial 
in z: one between rows 1 and 2, one between rows 2 and 3, and so on. Each of these 
factors can be broken down into elementary matrices using the formula 

(o or'Ho l\[a-l l](o 1 )(-(«"!)« l) 

(this is (2. 1) with r = 1 and s = a - 1 ). For the first factor, a is of the form z k , so 
a -I and (a-l)a are divisible by z-1; thus, the elementary matrices here are in Q. 

For the remaining factors the restriction does not apply. This completes the factorization 
into elementary matrices in Q . 
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As in the preceding section, one can get by with far fewer factors by using unit 
triangular matrices rather than elementary matrices in the n x n case. 

Again, if one has a causal transformation of determinant 1 and wants causal 
elementary factors, one can get them by the same procedure, using ordinary polynomial 
degrees instead of 'degrees' and always trying to eliminate leading coefficients rather than 
"leading or trailing" coefficients. (The entries in the final diagonal matrix will be 
constants.) So any causal matrix in Q can be factored into causal elementary matrices 
in g. 

For a causal transformation whose determinant has z factors, one can first check 
whether the first column of the matrix is divisible by a power of z; if so, this power can be 
pulled out as a diagonal matrix on the right (which just shifts band 1, and hence preserves 
a constant signal in this band). Once this is done, the first column can be reduced as 
usual, and then the Smith normal form process can be applied to the lower right 
(n-l)x(n-l) submatrix. Then a diagonal matrix can be pulled out on the left (in two 

parts, as at the end of the preceding section), and the reduction of row 1 using the 
remaining rows (which now look like the identity matrix) can be completed. 

In the proof of Theorem 11.1, the first row and column of the matrix must be 
handled specially, but there is no restriction on the remaining rows and columns; they can 
be reduced by any Euclidean algorithm steps desired. This extra freedom can be used to 
obtain additional properties of the factorization, if desired. 

For instance, suppose k < n is fixed, and we are given annxn matrix A(z) over 
R[z, z -1 ] of determinant 1 with the property that A(l)e 1 = e 1 for all i<k, where e t is a 

vector with 1 in entry i and 0 elsewhere. In other words, the transformation given by A(z) 
preserves a single-band constant signal in any of the first k bands. Then A(z) can be 
factored into elementary matrices which also preserve constant signals in these bands. To 
see this, first perform the reduction on the first column as in the proof of Theorem 11.1, 
where now the first Arrows are restricted (any elementary operation with one of these 
rows as the source must use a multiplier divisible by z - 1). We can continue this until no 
more legal reductions are possible; at this point all of the unrestricted rows will have been 
zeroed out. Since the determinant of the matrix is 1, the remaining nonzero entries in the 
column must have greatest common divisor 1, so we can obtain the number 1 as a sum of 
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multiples (by elements of R[z, z -1 ]) of these entries. Using the same multipliers with an 
additional factor of z - 1, we can perform legal elementary operations so as to make the 
entry in row n of this column equal to z - 1. Now, since the first entry in this column is 1 
plus a multiple of z - 1 (this was true at the start, and all operations preserved it), we can 
perform one more elementary operation from row n to row 1 to change this first entry 
to 1. Now legal elementary operations from row 1 can be used to zero out all of the other 
entries (which are multiples of z - 1). Next we proceed to the second column and do the 
same thing in rows 2 through n\ then we can easily eliminate entry 1 in this row using 
the 1 in entry 2. Proceed this way through the first k columns, and then use the 
unrestricted algorithm to handle the rest. 

SMALL-STENCIL FACTORIZATIONS 
When we factor a z-transform matrix into elementary matrices, we are 
decomposing the corresponding transformation into steps which allow only a very 
specific form of interaction between parts of the signal. However, this form of interaction 
can still be very wide-ranging, because arbitrary powers ofz are allowed in the 
multipliers occurring in the elementary factors. One may want to restrict the factors 
further so as to require the interactions to be fairly local. 

Let us consider the case of 2 x 2 matrices first. This is the case where the signal 
(long sequence of numerical values) is broken up into two-entry blocks. An elementary 
matrix factor with nonzero entry in the upper right corner will modify this signal by 
leaving the second entries in the blocks alone, but adding some linear combination of the 
second entries to the first entries. If the nonzero entry is in the lower left corner, then a 
linear combination of the first entries will be added to the second entries. 

A natural locality restriction would be to require that the number added to a 
second entry be computed from the two neighboring first entries (the one in the same 
block and the one in the next block), and the number added to a first entry be computed 
from the two neighboring second entries (the one in the same block and the one in the 
previous block). This means that we allow only elementary matrix factors of the forms 

f 1 ""I and ! °X 

[0 1 J (rz l +s lj 

where r and s are constants. 
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For n x n matrices where n > 2 , it is less obvious what the exact meaning of 
"small-stencil" or "local" elementary matrix should be. One could allow only nearest- 
neighbor interactions as in the 2 x 2 case, but this would be extremely restrictive; it would 
allow only elementary matrices where the nonzero off-diagonal entry is a constant 
adjacent to the diagonal, a monomial rz in the upper right corner, or a monomial rz~ x in 
the lower left corner. It would be more flexible to allow interactions between the fth 
entry in a block and the two closest ;"th entries, one on each side. This would allow the 
nonzero off-diagonal entry of the elementary matrix to occur anywhere, but: if it is above 
the diagonal, it must be of the form rz + s ; if it is below the diagonal, it must be of the 
form rz' 1 +s. (Or one may want a different definition here if one is trying to meet 
particular implementation restrictions.) 

It turns out that, even with the restrictive nearest-neighbor definition, it is always 
possible to factor a z-transform matrix of determinant 1 into small-stencil elementary 
factors. Since we already know how to factor such a matrix into unrestricted elementary 
factors, we just need to express a given elementary matrix as a product of small-stencil 
elementary matrices. Next note that, because of equations such as 

(i2 -» (J $ ?} 

we may assume that the nonzero off-diagonal entry in the given elementary matrix 
is a monomial. 

In terms of the unblocked signal, this transformation adds c times entry number i 
+ knto entry number j + kn for all integers Jfc, where c, i, and; are given constants (and; - 
i is not divisible byn). If this is not already a nearest-neighbor interaction (i.e., 
|;-i|>l), then it can be changed into one by using nearest-neighbor swaps to move the 
interacting entries closer to each other. For instance, if j-i >1 , then we can swap entry 
j-l + kn with entry ;' + kn for all integers k. This will not move the entries in positions i 
+ kn unless j-l-i is divisible by n, in which case these entries are moved one place to 
the right. So the interacting entries will end up one or two places closer to each other. 
Repeat this until the interacting entries are adjacent, do the operation which performs the 
interaction, and then reverse all the swaps. This factors the nonlocal operation into a 
sequence of local operations (including swaps). 
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We do not want to use literal swaps, though, since they have determinant -1 as 
linear operations. Instead, we negate one of the two entries being swapped; this 'swap' or 
swap-and-negate is a 90-degree rotation between two bands. 

Returning to the z-transform matrices, this states that we can factor our non-local 
monomial elementary matrix into a local monomial elementary matrix and a number of 
local 'swaps.' A local 'swap' which does not cross block boundaries looks like an identity 
matrix except that some 2x2 block centered on the diagonal is changed to 



0 -1 

1 0 



If the 'swap' does cross a block boundary, then it is an identity matrix with the 
four corner entries changed to 











or 











Here is an example with n = 3: 
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It now remains to note that each local 'swap' can be factored into three local 
elementary matrices. For the case where the 'swap' does not cross a block boundary we 
just use (10.1). If the 'swap' does cross a block boundary we use a very similar formula: 



0 



In- summary, we have: 

Proposition 12.1. Any matrix over R.[z, of determinant 1 can be factored 

into small-stencil elementary matrices. 

This holds under either definition of "small-stencil." Note that a large number of 
factors may be required. 

If a given z-transform matrix has determinant a monomial other than 1, then of 
course it cannot be factored into small-stencil elementary factors, because it cannot be 
factored into elementary factors at all. But if we allow a simple one-step shift and/or 
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negation in one band (i.e., the identity matrix with one diagonal entry changed to ±z ±1 or 
-1) to be considered "small-stencil," then a factorization into small-stencil factors can be 
achieved. To see this, recall from previous sections that one can factor the given matrix 
into elementary matrices and diagonal matrices with diagonal entries of the form ±z k ; the 
elementary parts are handled as above, and the diagonal parts are taken care of by these 
new factors. Similar remarks apply in the next two sections. 
SIMULTANEOUS SMALL-STENCIL AND CONSTANT-PRESERVING FACTORS 

In the preceding two sections we considered two extra properties that can be 
achieved in a factorization of a suitable z-transform matrix. Is it possible to achieve both 
of these properties at the same time? 

First consider the more flexible definition of "small-stencil" from the previous 
section; we will see that suitable factorizations do exist in this case. Suppose we are 
given a matrix in the group Q . We can factor the given matrix into elementary matrices 
in Q using the methods discussed in the section entitled "Factors Which Preserve 
Constant Signals"; some of these have the nonzero off-diagonal entry in the first column, 
and others do not. For the ones which do not, the off -diagonal entry is unrestricted; we 
may assume that the off-diagonal entry is a monomial because of (12.1). This matrix can 
now be factored into a local elementary matrix and some local 'swaps' using the method 
described in the previous section. 

For the elementary matrices with off-diagonal entry in the first column, we are not 
allowed to reduce to the case of monomials; instead, using (12.1), we can reduce to the 
case where the off-diagonal entry has the form c{z k - z* -1 ) for some real constant c and 

some integer k. This means that c times an entry in the source band is added to an entry 
in the destination band and subtracted from the next entry in the destination band. By 
performing a suitable sequence of 'swaps,' one can arrange for each source entry to lie in 
between the two destination entries it will affect. Then the desired operation will be 
small-stencil under the flexible definition. Afterwards the 'swaps' can be reversed to 
restore the bands to their original positions. 

The factors here are not in the group Q ; instead of leaving the constant signal 
alone on the first band, they move it around from band to band. But the elementary 
operations specified above do leave the constant signal unchanged on whatever band it is 
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currently in when the operations are applied. As for the 'swaps,' when one of these is 
factored into three elementary steps, the constant signal may appear on two bands 
simultaneously, but it will be restored to a single band (although perhaps negated) by the 
time the 'swap' is complete. So the corresponding integer approximation maps will 
always leave this integer constant signal unaltered (somewhere), and when all of the 
factors have been performed the constant signal will end up where it started, unchanged 
by the integer approximation map. 

Now suppose we want to use the restrictive nearest-neighbor definition of "small- 
stencil." Here we assume n>3, because the case n = 2 is already handled above. The 
same procedure described above works here, except that the elementary operation adding 
the band containing the constant signal to another band (multiplied by c in one direction 
and by -c in the other direction) is no longer allowed and must be decomposed further. 

Suppose that the band currently containing the constant signal is band i, and we 
want to add it to band j : for each entry x in band i, we are to add cx to the nearest entry to 
the right in band j and subtract cx from the nearest entry to the left in band j. Let / be a 

band adjacent to j which is not band i. Now perform the following procedure: 

subtract c times band / from band j; 

move band i up to band f—1', 

add band j-\ to band /; 

move band /-I down to band / + 

subtract band j+1 from band /; 

move band / + 1 up to band i; 

add c times band j to band j; 

move band i up to band / — 1 ; 

subtract band f-1 from band /; 

move band f - 1 down to band / + 1 ; 

add band f+1 to band /; 

move band / + 1 up to band L 
Each "add" or "subtract" here is a nearest-neighbor elementary operation. "Move 
band i up to band / — 1 " means that, if band i is not already immediately below band j 
(if it is, do nothing), then 'swap' band i with band i + 1 (the band moving from i + 1 to i is 



the one which is negated), then 'swap' band i + 1 with band r + 2, and so on, wrapping 
around from n to 1 if necessary, until the band being moved reaches / - 1 (or n if / = 1 ). 
The other "move" steps are interpreted similarly. Each of these 'swaps' is factored into 
three nearest-neighbor elementary operations as usual. 
5 One can check that the net effect of this procedure is as desired: for each entry x 

in band i, the procedure adds cx to the nearest entry to the right in bandy and subtracts cx 
from the nearest entry to the left in band;'. When it is applied to input containing a 
constant signal in band i and nothing elsewhere, the "subtract c times" operation has no 
effect, the next five steps add the constant signal to band / and then subtract it from 
10 band / for no net effect, the "add c times" step does nothing because there is nothing 

fr* currently in band / , and the last five steps again subtract and add the same signal from 

O band f for no net effect. So this procedure preserves the constant signal in band i. 

Thus, even under the strictest definition of "small-stencil," one can find a 

*S factorization of a given matrix in Q into elementary factors so that the resulting integer 

03 15 approximation map (p preserves a constant signal in band 1 . 

j\ But suppose one does not want the constant signal to roam from one band to 

tU another in this way. Is it still possible to achieve a small-stencil and constant-preserving 

~~ factorization? In other words, can every matrix in the group Q be factored into small- 

stencil elementary factors which are also in the group Q ? 
20 The answer to this also turns out to be yes, if the flexible definition of "small- 

stencil" is used. Let us first consider the 2 x 2 case. We can factor the given matrix into 
elementary matrices in Q as before. Again as before, we can reduce to the case where 
the nonzero off-diagonal entry of an elementary matrix is a monomial if it is not in 
column 1 , and is of the form c (z k - z k ~ x ) if it is in column 1 . 
25 If an elementary matrix has as its nonzero entry a monomial at the upper right, we 

can handle it using the factorization 

1 cz 2k+i \ (z 0 Y (l cz l \ (z 0 



[0 1 J [0 z~ l j [0 1 J [0 
where k is an integer and i is 0 or 1. The elementary matrix appearing on the right 
here is small-stencil, and the z-shift diagonal matrix has the following small-stencil 
30 factorization in Q : 
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l oWi z-'-i 
0 z~ l ] I 1 1 0 1 



- X -z 1 
2 



f 1 2-2 Z -'"| 








1° 1 J 









The other elementary matrix, with a binomial at the lower left, is handled by the 
factorization 

[c(z^-z^) l) = (o (rf(l-z*) l) (o 

The elementary matrix on the right here is small-stencil if i = 0. If z = 1, we need 
to factor it further: 

1 OWl f 1 0 s ] 



| [c(z-l) lj (0 1 J [of'-c lj 

S (\ -kc- l z\ (I OYl 



[0 1 J [2C-2CZ- 1 1 |o 1 

This completes the factorization into small-stencil elementary matrices in Q . 

For the n x n case, first factor the matrix into elementary matrices in Q as in the 
previous section entitled "Factors Which Preserve Constant Signals." Each of these 
elementary matrices only affects two of the n bands, so it can be factored into local 
elementary matrices by the methods for the 2 x 2 case above; under the more flexible 
definition, these factors are small-stencil. 

If the strict nearest-neighbor definition of "small-stencil" is used, then there are n 
x n matrices in Q for n > 3 which cannot be factored into small-stencil elementary 
factors in Q . In fact, a small-stencil elementary factor in Q cannot have its nonzero off- 
diagonal entry in the first column, so only matrices with leftmost column e^^ can be 
products of such matrices. 

So we have: 

Proposition 13.1. Under either definition of "small-stencil, " any nxn matrix A 
in Q can be factored into small-stencil elementary matrices so that the corresponding 
integer approximation preserves constant signals in band 1. Furthermore, any matrix 
in Q can be factored into small-stencil elementary matrices in Q under the flexible 
definition of "small-stencil, " but ( if n > 3 ) not under the strict definition. 
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The results in this section and the preceding one seem to indicate that requiring 
factors to be small-stencil substantially increases the size of the factorization. However, 
this is normally true only when one is factoring unusual matrices with long-scale but no 
short-scale interactions. For more typical matrices consisting of Laurent polynomials 
with no gaps in their coefficients, it is common to obtain the small-stencil property with 
no additional effort during the factorization process, or with only a small amount of care 
when one has a choice to make. An example of obtaining the small-stencil property with 
no added effort is shown in the upcoming section entitled "Example: The 9-7 Wavelet." 
CAUSALITY AND SMALL-STENCIL FACTORIZATIONS 

We noted earlier that the the algorithms in the sections entitled "Factoring A z- 
Transform Matrix" and "Factors Which Preserve Constant Signals" need only slight 
modification so that, when applied to a causal matrix (one where no negative powers of z 
occur), they yield causal factor matrices. However, the algorithms in the sections entitled 
"Small-Stencil Factorizations" and "Simultaneous Small-Stencil And Constant-Preserving 
Factors" involve moving bands back and forth, thus introducing non-causal factors even 
when the original matrix is causal. If one wants a factorization into small-stencil 
elementary matrices which are also causal, then one will need modified methods, at least. 

For an elementary matrix to be both causal and small-stencil, its nonzero off- 
diagonal entry must be of the form rz + s. If the flexible definition of "small-stencil" is 
used, then the z-coefficient r is allowed to be nonzero only for entries above the diagonal. 
The strict definition of small-stencil imposes stronger restrictions: the off-diagonal entry 
must be a constant adjacent to the diagonal or a monomial rz in the upper right corner (in 
the 2 x 2 case, a binomial rz + s is allowed in the upper right corner). 

It turns out that, in the 2 x 2 case, causal small-stencil factorizations cannot 
always be attained: 

Proposition 14.1. There exists a 2 x2 matrix over R[z] of determinant 1 which 
cannot be expressed as a product of causal small-stencil elementary matrices. 

Proof. Suppose a given non-constant 2x2 matrix A can be written as a product 
of causal small-stencil elementary matrices. A factor with an entry rz + s can be split up 
into a factor with entry rz and a factor with entry s. So A can be written as a product of 
constant matrices of determinant 1 and elementary matrices with upper right entry of the 
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form rz. Express A as such a product with a minimal number of factors. (In this product, 
the two factor types must alternate, because two adjacent factors of the same type could 
be combined into one. Note that at least one rz factor must occur.) 

The last factor in this product has at least one nonzero entry in its bottom row; 
select a column (column 1 or column 2) whose bottom entry in that last factor is nonzero. 
(If the last factor is an rz factor, column 2 will be selected.) Now multiply out this 
product of matrices from right to left. We will show by induction that, at each stage of 
this process (starting after the first rz matrix has been multiplied in), the two entries in the 
selected column of the partial product will have degrees differing by at most 1; in fact, if 
the last matrix multiplied in was an rz matrix, then the upper entry in the selected column 
will have degree 1 more than the degree of the lower entry. 

The partial product just before the first rz matrix is multiplied in is constant, and 
its selected column has nonzero lower entry. Hence, after the rz matrix is multiplied in, 
the upper entry in the selected column will have degree 1 and the lower entry will have 
degree 0. 

Suppose that (after multiplying by an rz matrix) the selected column in the current 
product has upper entry of degreed and lower entry of degree d— 1. Then, after 
multiplying by a constant matrix of nonzero determinant, one of the two entries will have 
degree d and the other will have degree d - 1 or d. The only way in which the lower 
entry will still have degree d -1 is if the lower left entry of the constant matrix is 0. 

Now suppose we have just multiplied in a constant matrix of determinant 1, and 
are about to multiply in an rz matrix (not the first), and the selected column has entries of 
degrees differing by at most 1. Say the larger of the two degrees isd. The constant 
matrix just multiplied in cannot have lower left entry 0, because if it did we would have 
three consecutive factors of the form 




and these could be replaced by the two factors 
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contradicting the minimality of the factorization. So the lower entry of the 
selected column currently has degreed, while the upper entry has degree d-l ovd. 
After multiplying by the new rz matrix, the degree of the upper entry of the selected 
column will be d + 1 and the degree of the lower entry will be d. This completes the 
induction. 

In particular, the selected column of the final product A will have entries with 
degrees differing by at most 1. Now, the matrix 

CO 

is a 2 x 2 matrix of determinant 1 which has no column whose entries have 
degrees differing by at most 1. Therefore, this matrix cannot be the matrix A above; in 
other words, this matrix cannot be factored into causal small-stencil elementary factors. 

On the other hand, the presence of a third band yields enough extra freedom to 

allow causal small-stencil factorizations: 

Proposition 14.2. If n > 2, then every nxn matrix over R[z] of determinant 1 

can be expressed as a product of causal small-stencil elementary matrices. 
(This is true even under the strictest definition of "small-stencil.") 
Proof. We already know that such a matrix can be written as a product of causal 
elementary matrices. By (12.1), these elementary matrices can be factored into monomial 
elementary matrices, where the nonzero off-diagonal entry has the form cz k for some 
non-negative integer k. So it suffices to show that such a monomial elementary matrix 
can be written as a product of causal small-stencil elementary matrices. 

If k = 0 , then we can do this using nearest-neighbor 'swaps' and one nearest- 
neighbor elementary matrix as in the section entitled "Small-Stencil Factorizations" the 
resulting factors are all constant and hence causal. For k = 1, note that the elementary 
matrix with upper right entry cz (i.e., adding cz times band n to band 1) is causal and 
small-stencil; by combining this with constant nearest-neighbor 'swaps' (moving the 
source band to band n and the destination band to band 1, without wrapping around), we 
can handle an elementary matrix which adds cz times band i to bandy. 
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Once we know how to add cz k times band i to band j for any i and j, we can add 
cz k+l times band i to band J for any i and j as follows: pick a band / different from i 
and j, and do: 

add z times band z to band / ; 

add cz k times band / to band j; 

add -z times band i to band / ; 

add - cz k times band / to band j. 
Therefore, by induction, we can handle any monomial elementary matrix. 
What if we want a causal small-stencil factorization which also preserves a one- 
band constant signal? Let us first assume we are using the flexible version of "small- 
stencil." The strong version of constant preservation where the constant signal must be 
held in band 1 only (i.e., matrices in the group Q) cannot be achieved here, because all 
causal small-stencil elementary matrices in Q have first column e h so any product of 
such matrices also has first column ej. 

However, if we put the constant signal in band n instead, then a factorization 
which is causal, small-stencil, and strongly constant-preserving can be attained. (It 
follows that, if the constant band is allowed to "roam," then the factorization can be 
achieved no matter which band initially contains the constant signal.) To see this, note 
that, using permissible factors, we can add c(z -l) times band i to band j if j < i, we can 

add c times band i to band /' if i < n, and we can 'swap' bands i and j if i, j < n. Next, we 
can add c(z - 1) times band i to bandy if i < j: if j < n, 'swap' bands i" and /, add -c{z -l) 

times band j to band i, and 'swap' bands j and i; if j - n, find j different from i and;', add 
c(z-l) times band i to band /, add band j to bandn, subtract c(z-l) times band/ 
from band /, and subtract band / from band«. Hence, using the recursive method 
from Proposition 14.2, we can add c(z -l)* times band i to band j for any distinct i and j, 
where k is any nonnegative integer if i < n and k is any positive integer if i = n. 

Now, given a causal matrix which preserves a constant in band n, we can factor it 
into causal elementary matrices preserving a constant in bandn by the methods of 
described in the section entitled "Factors Which Preserve Constant Signals." Such an 
elementary matrix adds p(z) times band i to band where the polynomial p(z) must be 
divisible by z - 1 if i = n. To factor this matrix further, just expand p (z) in powers of z - 
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1; each term in this expansion can be handled by the above, so, by (12.1), the whole 
elementary matrix can be factored into permissible factors. 

For the strict version of "small-stencil," we already know by the argument from 
the section entitled "Simultaneous Small-Stencil And Constant-Preserving Factors" that 
strong constant preservation cannot be achieved (this argument works no matter which 
band contains the constant). However, if we allow the constant to roam, then a causal 
small-stencil factorization is possible. For this, it suffices by the above to be able to add 
c(z-l) times the constant band to another band; this can be done by a version of the 
twelve-step method from the section entitled "Simultaneous Small-Stencil And Constant- 
Preserving Factors." Specifically, this method does not "wrap around"; instead, it handles 
the z part by moving the constant band to band n and the intermediate destination band 
" / " to band 1. 

EXAMPLE: THE 9-7 WAVELET 
As an example of the methods presented here, we consider a 9-7 wavelet which 
has been found to be well-suited for image compression and is in common use. 

The exact formulas for the filter coefficients for this wavelet are given in the FBI 
fingerprint compression standard. The coefficients are expressed in terms of x l , where 
J -14yT5+63 Y /3 ( -14yl5-63 T 3 _ 1_ 
1080715 J { 1080Vl5 J 6 

is the real root of the equation 

20*? + 10xf+4 Xl +1 = 0 
The referenced formulas also use a complex number x 2 , but they can be 
expressed in terms of x 1 using the formulas 

|x 2 | 2 =Jcf+x,/2+l/5and 
9U 2 =-V2-l/4, 
so x 2 is not needed. The filter coefficents then become: 
^(0)= -yflx, (240x? + 160^+83)/ 32 
= 0.8526986790094034 
\ (±1) = - yf2x, (l 60*? + 90*, + 37) / 32 
* 0.3774028556126538 



LIZA\18255AP.DOC 



-63- 



*b(±2)= -^(lOxf-3)/8 
—0.1106244044184234 
^(±3)= 572^(2^+1)/ 32 
—0.0238494650193800 
^(±4)= -5>/2^/64 

- 0.0378284555069955 
*k(-l)= V2(6jc,-1)/(16x,) 
« 0.7884856164056644 

ft,(-2) = h l (0) = -j2(l6x l -l)/(64x 1 ) 

= -0.4180922432222122 

*(-3) = ^(l)=^(2xi+l)/(32xi) 

— 0.0406894176095584 
\(-4)= h l (2)=-j2/(64x 1 ) 
« 0.0645388826289384 



The z matrix specified by these filter coefficients is 



a n (z) = h o (-4)z~ 2 +h 0 (-2)z- 1 +M0) + h 0 (2)z + K(4)z 2 

<h 2 (z)=K + K (-1) +h 0 (i)z+h 0 (3) z 2 

«2i (*) = \ (-4) z~ 2 + K (-2) z" 1 + ^ (0) + \ (2) z 
a 22 (z) = M-3)z- 1 +M-l) + h l (l)z 



It is already known to those of ordinary skill in the art how to factor M(z) into four 
elementary matrices and a constant diagonal matrix. In fact, these factors have the same 
symmetry as the matrix itself. The factors are also small-stencil; however, the integer 
approximation (of course, one has to factor the constant diagonal matrix further to get the 
integer approximation) does not preserve the constant signal. This is inevitable using 
symmetric factors, because requiring symmetry makes the factorization essentially 
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where 



unique. We will see that the use of asymmetric factors gives the extra freedom necessary 
for constant preservation while still using small-stencil factors. Since the given matrix is 
not causal, we do not need to look for causal factors. 

The determinant of M(z) is 1. However, M(z) does not send a constant signal with 
5 value & to a constant value k on band 1 (the low-pass filter) and zero on band 2 (the high- 
pass filter); it sends this constant signal to a constant *j2k on band 1 and zero on band 2. 
We therefore pull out a constant diagonal scaling matrix factor 



and work with the matrix S *M (z) from now on; for applications such as 
10 compression this scaling factor makes little difference anyway and is less important than 
constant preservation. 

Next, as specified earlier, we pull out a factor 



s from the right, leaving a matrix A(z) = S l M (z)A 1 satisfying A(l)e l = e 1 . We 

pj 15 will now work out a small-stencil constant-preserving factorization for A(z) (more 
efficient than the one described earlier in the section entitled Simultaneous Small-Stencil 



We start eliminating in the first column. First we do a row operation to eliminate 
the z 2 -term from a n . (Note that this must also eliminate the z 2 -term from a 12 , 

20 because otherwise the determinant of the new matrix would have a z 3 - term .) We have 
an extra degree of freedom here, so we can eliminate the z -2 - term from a u (and the 
z~ l - term from a n ) at the same time; this is what the usual symmetric factorization 
process does. Next, we do a row operation to eliminate the z~ 2 - term from a 2l ; here 
there is no extra degree of freedom, and we have to break symmetry to maintain constant 

25 preservation. The third step eliminates one term from a n , and this must be the trailing 
term (the z' 1 - term) in order to make later factors small-stencil. The fourth operation 
eliminates the z-term from a 2l , and the fifth operation eliminates the z-term from Ojj . 
The remaining is a constant, and since the matrix is in the group Q , the constant must 
be 1. In fact, we find that the remaining matrix is elementary (unit lower triangular) and 





and Constant-Preserving Factors). 
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small-stencil. This remaining factor can be combined with the factor A , which is also 
unit lower triangular. This yields the factorization: 

S-'M (z) = U t {z)h (z)U 3 (z)L 4 (z)U 5 (z)h (z) 
where U, (z) is small-stencil unit upper triangular and L ; (z) is small-stencil 

lower triangular: 



1 r i z + s i 
0 1 

The coefficients in these factor matrices are: 
ri =5x?/2 
r 2 =(20^ 2 +3)/4 
r 3 =0 

s 3 = (-410^ -90*, +13)/ 110 
r 4 =-(40x 2 +5)/4 
r 5 = (-70* 2 + 45^ +2l)/55 

The numerical values of these coefficients are: 
r, = 0.2930671710299618 
r 2 = 1.3361343420599236 
r 3 = 0 

r 4 = 2.4222686841198471 
r. = -0.0475120919539189 



M0 = 



KZ +S, 1 



s 5 =0 
s 6 = r 6 -l 

s t ~ 0.2930671710299618 
s z =-1.3361343420599236 
s 3 =-0.0386222501060046 
s 4 = 2.4222686841198471 
s 5 = 0 

s 6 =-1.5861343420599236 



25 



r 6 = 0.5861343420599236 
Note that, for each i < 6 , there is a simple relation between r t and s t (or one of 
them is 0). This means that, in each case, the rounded value {r t a + s t b) for integer 

arguments a and b can be computed by integer additions or subtractions together with a 
single operation of the form c (re) , where c is an integer and r is r t or s t . If the latter 
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operation can be performed by lookup in a precomputed table, then floating-point 
arithmetic can be avoided altogether. 

Since this factorization is not symmetric, it has a mirror-image form which can be 
obtained by reversing the signal, applying the factorization as above, and reversing again. 
To do this algebraically, we replace z with r 1 and conjugate by 

note that this leaves M(z) unchanged. The effect on the factors is to simply 
interchange r t with s t for each i. 

We now perform some error analysis for this factorization, starting with the norm 
method. 

Computing the norm of an arbitrary z-transform matrix appears to be a messy 
nonlinear optimization problem, but for an elementary matrix it is feasible. Let p(z) be 

the nonzero off-diagonal entry of a 2 x 2 elementary matrix B. Let b be the absolute 
value of the constant term of p(z), and let a be the sum of the absolute values of the 
coefficients of the nonconstant terms of p(z) . Then j|JB|| is the maximum value of 
*Jx 2 +(y + bx + af 
for real numbers x and y such that x 2 + y 2 = 1 . 

In fact, the same formula works for the norm of an n x n elementary matrix with 
nonzero off-diagonal entry p(z). Here we need to maximize 

y]x 2 +(y + bx + af + xl+xl + --- + xl 

subject to the constraint x 2 + y 2 + x 2 + --- + x 2 =1. This is equivalent to 

maximizing 

x 2 + (y + bx + a) 2 + x%+--- + x 2 -(x 2 + y 2 + x 2 + --- + x 2 n ) = 2y(bx + a) + (bx + af under 

this same constraint. If we hold x fixed, then the new objective function is linear in y, so 
it is maximized at one of the two extreme values of y; these extreme values occur when 
x 3 =■■■ = x n =0 . This reduces the problem to the 2 x 2 case. 

Actually computing this maximum requires he solution of a quartic polynomial 
equation in general, so one will normally resort to numerical approximation. But there 
are some special cases where the answer is simpler: 

if £ = 0, then Wl = fl + 1 ; 
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if a = 0, then 



D 



|2(a 4 +5fl 2 +2+Wa 2 +3) 

«H J 7T3 



if a = fc, then 

For the matrices U t (z) and L t (z) , we have a = |/;| and & = \s } \ . Five of these six 
matrices fall under the special cases above; we handle the remaining matrix Zg (z) by 
numerical methods. The resulting matrix norms are: 

(z)\ = 1.4037242452456795 
\\L, (z)\\ « 3.1 167781378219282 
p 3 (z)\\ = 1.0194975674480953 
|L 4 « 5.1277219075789451 



O 

m \\U 5 (z)\\ - 1.0475120919539189 



||L 6 (z)\\ « 2.6101749209829466 



_ Now we can use (2.1) to compute error bounds: for the forward transform the 

IT! error bound is about 29.03464691 16757969; for the inverse transform (which can also be 

P 15 computed using norms because we are in the 2 x 2 case) we get a bound of 
S 39.6038983737180800. 

H These bounds can probably be improved by direct error analysis in the manner 

discussed earlier, but this would require analyzing a combination of 17 separate errors 
(which are probably not independent). Instead we go to empirical methods. A random 
20 sample of over 4.6 xlO 9 test cases (using random integers chosen uniformly from the 
interval [-2 16 , 2 16 -l]) yielded the following worst errors: 

Forward transform: error « 4. 1636582346765949 for input 
-2522 "\ f -6636^1 ( -3046 "| ( 6398 ^ f -6254^ 
-16164 )\ 658 J '[-14296 J \ 10921 J'[ 8138 J 
Inverse transform: error =4.0303761353834788 for input 
f 757 "I f -15135^ f -11480") ( 6895 "| f-10013"| 
[l0905 j'[ 11419 J'[ 511 J'[-1806J'[ 11732 J 

(One needs five successive input pairs of low -band, high-band entries to compute 
one output pair.) 
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One might expect the alternate mirror-image form of the factorization to have the 
same error bounds. However, the reflection also changes the pairing between the band-1 
entries and the band-2 entries. (When the input signal is split into length-2 vectors, each 
band-1 entry is paired with the entry that immediately follows it in band 2. After the 
reflection, these two entries will end up in separate vectors.) So there is no reason to 
expect the error bounds to be identical. In fact, testing yields inputs with errors slightly 
worse than those found for the unreflected factorization: 

Forward transform: error =4.2264010122204445 for input 
(12962^1 f-15095^ (-A21l\ f 12318^1 f -13212^1 
[l2976 )'[-13917 J '[-3962 J'[ 6625 J'[ -5853 J 
Inverse transform: error =4.1588504091004767 for input 

r-4703^1 f -12506^ f 13822 f 3251 ^ f -14943^ 
[-8068 J'[ -7893 J '[-6129 J '[-14093 J'[ -5253 J 
Of course, there are many possible factorizations other than these two; finding an 
optimal one appears to be quite a challenge. 

USING IIR POLYNOMIAL DIVISION 
As we have noted before, a factorization of a z-transform matrix into elementary 
factors may be much longer than a factorization of a constant matrix; in fact, there is no 
fixed bound on the length of the z-transform factorization. The main reason for this is 
that the entries in the elementary matrices are quotients. We can divide by an arbitrary 
nonzero number, but we cannot divide by an arbitrary nonzero polynomial because only 
Laurent polynomials are allowed as entries in factor matrices. We will now look at what 
happens if we relax this restriction. 
If 



has determinant 1 and we can divide by a 2l , then we can factor A into three 
25 elementary matrices as described in the section entitled "Preserving Particular Lattice 
Points: 



1 












[a* l] 




0 


1 


0 1 
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So when can we divide by a 21 ? Clearly we can if a 21 is a nonzero constant or 
monomial. In other cases the process will be IIR (infinite impulse response) rather than 
FIR (finite impulse response). 

For instance, suppose a 21 (z) = 1 - cz . Then y = 1 + cz + c 2 z 2 + ■■■, so adding, 

say, / times the second band to the first band would involve combining entries from 

arbitrarily far in the past from the second band to update the current entry in the first 
band. 

This also raises the issue of numerical stability. If |c| > 1 , then we are adding 

larger and larger multiples of older entries from the second band to the first band, and the 
process is quite unstable and leads to an unbounded result. If |c| < 1 , though, then the 

process is stable and the result of applying it to a bounded signal is still a bounded signal. 
If p(z) is a nonzero Laurent polynomial whose zeros (other than 0) all have 

absolute value greater than 1, then p can be written as a product of a monomial and some 
number of factors of the form 1 - cz with |c| < 1 . Since we can divide by each of these 

factors stably in turn, we can divide by p stably. 

In fact, the process of dividing byp is just the standard long-division algorithm 
for (Laurent) polynonials, starting at the low-degree end. There is no need to factor p into 
linear factors and divide by them separately; the result is the same if one just performs the 
long division by p directly. This also means that one does not have to remember the 
entire past signal during the long division process; one only has to remember the current 
partial remainder, which is no longer than p. 

Let us now return to the polynomial 1 - cz , and assume this time that |c| < 1 , so 

we cannot simply perform the division as above. What we can do instead is rewrite 
1 - cz as -cz (l - c~ x z~ l ) . The monomial -cz causes no difficulty, and, since \c\ < 1 , the 

expression y, _ x _, x = 1 + c"V + c" 2 z' 2 + •■■ has decreasing coefficients and leads to 
/{l-c z ) 

a stable division algorithm. 

This corresponds to simply doing the long division in the opposite direction, 
starting from the high end (which is what one commonly does with polynomial division 
anyway). Again one can handle multiple factors of this form with a single long division. 
Of course, we are giving up on causality here. 
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In general, suppose a 21 is a Laurent polynomial which does not have any 
(complex) zeros of absolute value 1 . Then we can factor a 2X into two parts, one having 

the zeros of absolute value greater than 1 and the other having the zeros of absolute value 
less than 1. This lets us express a 21 in the form m(z) p(z)q[z' 1 ) , where m(z) is a 

monomial and p and q are polynomials (ordinary, not Laurent) with constant term 1 
whose zeros are all of absolute value greater than 1. Dividing by m(z) is easy; we divide 
by p(z) and q(z~ l ) successively using IIR long division, with the first division 

proceeding from low to high degree and the second division proceeding from high to low 
degree. 

There are some special cases of interest. If a 21 is a symmetric Laurent 
polynomial (i.e., a 2X (z) = a 2X ^z ')) which has no zeros of absolute value 1, then the 

polynomials p and q are actually equal; we can absorb the monomial into the other factors 
and write a 2x in the form p(z)p[z^) for some polynomial p (although this will require 

complex coefficients if a 21 (l) < 0 ). Another common situation is for the the Laurent 
polynomial a 21 (z) to have one dominant coefficient whose absolute value is greater than 
the sum of the absolute values of all of the other coefficients. In this case, it is impossible 
for a 2X to have a complex zero of absolute value 1, so we definitely can factor a 2l as 
above for division purposes. 

Finally, what if a 2l has a complex zero of absolute value 1? The simplest case is 
a 21 = z - 1 . If a constant signal x is multiplied by a 21 , the result will be zero no matter 
what the constant is. Hence, given a 21 x, one cannot recover x; in other words, division 
by a 2x is not even well-defined. The same thing happens for any polynomial a 2l with a 
zero w such that |w| = l: there is a nonzero bounded signal x = ^...w 2 ,w,l,w -1 ,w -2 ,...) 
such that a 21 x is zero, so it is impossible to divide by a 21 . (If a 2l has real coefficients but 

w is not real, then one can take the real part of the signal x above to get a real nonzero 
signal annihilated by a 21 .) 

We have shown here that, in many cases, it is possible to use IIR integer- 
approximable factors for a FIR linear transformation, and that this may require fewer 
factors than a FIR factorization (thus possibly giving faster computation and lower error 
bounds). The main drawback of this method is that it requires processing the entire signal 
even if one only needs part of the output (say, a subinterval of the transformed signal). 
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The need for this is frequent enough that it is usually worthwhile to use FIR factors even 
when more of them are required. 

CONCLUSION 

We have shown that a z-transform matrix (for a perfect reconstruction FIR signal 
transformation) of determinant 1 can be factored into elementary matrices (liftings) in a 
variety of ways; this allows us to find integer approximations to these factors (and hence 
to the original transformation) with additional useful properties, such as locality of 
interaction, causality, and/or preservation of constant integer signals. 

Just as in the fixed-length case, there are a number of possibilities here that 
remain to be explored, including additional factorizations of matrices and improved error 
analysis. Also, additional study would be helpful for the case of more than two bands (as 
mentioned earlier, one can use unit triangular matrices instead of elementary matrices and 
thus reduce the number of factors; algorithms for producing efficient factorizations into 
such factors would be quite useful) and for multidimensional signals (where one cannot 
always factor a transformation into elementary matrices, and even when one can, such as 
when there are more than two bands, the number of such matrices may be excessive). 
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